[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary#891
Conversation
|
这个补丁能早期检测出sparse模型配置的异常,避免 #660 问题的出现。 原理:
本patch,检测模型是否sparse,并提前报错。 同时, 部分用户使用test逻辑进行predict的case也有潜在风险,本patch都能避免此类问���发生。另外用户使用py_paddle进行预测的case,不能被此patch捕获。 |
|
另外,从这个patch的定位过程总结来看,算法上test、predict、train三个阶段,如果能做到顶层代码的分离,尽量分离,底层模块尽量功能单一、简洁,顶层通过组装形式share底层逻辑最好,否则对理解问题比较困难。 当前整个sparse的设计,也尤其重构后的sparse逻辑,为了优化分开逻辑,将ids耦合到底层通信protobuf逻辑,也带来了一些问题,算法和通信过于耦合,有些顶层配置的异常直接暴露到最底层逻辑,对定位和理解问题避免的相对困难。 |
|
直接log(FATAL)不太好吧。我们应该可以改一下trainer来支持这个东西? |
即便要改,我觉得也应该是另一个pr来解决。 |
paddle/trainer/Tester.cpp
Outdated
| LOG(FATAL) << "It's prohibited to set sparse_remote_update " | ||
| << "in some layers if testing will be under going " | ||
| << "in the middle of training. You can do testing " | ||
| << "within separate process."; |
There was a problem hiding this comment.
It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.
There was a problem hiding this comment.
可以用grammarly检查一下语法。 不过,在这里面报错,会不会在不同进程里也会报错呢?
There was a problem hiding this comment.
test过程是要求不能用sparse配置的,所以test出错就是错误配置了。所以这种改法应该没有问题。
There was a problem hiding this comment.
It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.
follow you comments
* fix unified transformer dtype problem * fix win dtype bug * Fix plato-2 and plato-mini dtype bug * Fix plato-2 tokenization * Refine some doc * Add general k support for topk sampling * fix seed * minor fix * Fix unitransformer readme * topk kernel optimization * add unimo model and fix generate api * add 3 datasets for unimo-text Co-authored-by: Jiaqi Liu <liujiaqi06@baidu.com> Co-authored-by: liu zhengxi <380185688@qq.com>
* add guards for sm>=70 * drop guard to 530
fix #660
新接口已经不允许disable test datprovider,所以本patch 还要一些改动,
关联问题:
predict和sparse_remote_updater也不能共存,本patch暂时不解决这个预测问题。