[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary by backyes · Pull Request #891 · PaddlePaddle/Paddle

backyes · 2016-12-14T10:43:53Z

fix #660

新接口已经不允许disable test datprovider，所以本patch 还要一些改动，

关联问题：

predict和sparse_remote_updater也不能共存，本patch暂时不解决这个预测问题。

…iner if necessary

backyes · 2016-12-15T04:06:18Z

这个补丁能早期检测出sparse模型配置的异常，避免 #660 问题的出现。

原理：

sparse 集群模型下，会有一个prefetch逻辑，它根据当前处理的数据情况，选择从pserver拖取适当的训练参数，优化通信和整体的update效率。但是，test过程是一个forward过程，没有backward和update过程，因此当前将test和trainer公用一套逻辑优化在gpu上对内存的消耗的设计，使得test的forward会share一部分prefetch的逻辑和数据，因此集群sparse训练异常会指导forward才被捕获。

本patch，检测模型是否sparse，并提前报错。

同时，部分用户使用test逻辑进行predict的case也有潜在风险，本patch都能避免此类问��发生。另外用户使用py_paddle进行预测的case，不能被此patch捕获。

backyes · 2016-12-15T04:12:35Z

另外，从这个patch的定位过程总结来看，算法上test、predict、train三个阶段，如果能做到顶层代码的分离，尽量分离，底层模块尽量功能单一、简洁，顶层通过组装形式share底层逻辑最好，否则对理解问题比较困难。当前整个sparse的设计，也尤其重构后的sparse逻辑，为了优化分开逻辑，将ids耦合到底层通信protobuf逻辑，也带来了一些问题，算法和通信过于耦合，有些顶层配置的异常直接暴露到最底层逻辑，对定位和理解问题避免的相对困难。

reyoung · 2016-12-15T04:30:51Z

直接log(FATAL)不太好吧。我们应该可以改一下trainer来支持这个东西？

backyes · 2016-12-15T05:30:05Z

@reyoung

这个错误根源来自gradient里面。因此从trainer顶层貌似不��改。
如果warning忽略，
* 如果自动重写sparse配置（比如disable），这会造成非常大的影响，性能变慢。
* 如果自动disable test过程，那么距离用户理解的接口，差异非常大
目前直接FATAL，感觉其实还好。只有使用remote sparse配置的用户才会收到影响。

即便要改，我觉得也应该是另一个pr来解决。

reyoung · 2016-12-15T08:41:27Z

paddle/trainer/Tester.cpp

+    LOG(FATAL) << "It's prohibited to set sparse_remote_update "
+               << "in some layers if testing will be under going "
+               << "in the middle of training. You can do testing "
+               << "within separate process.";


It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.

可以用grammarly检查一下语法。不过，在这里面报错，会不会在不同进程里也会报错呢？

test过程是要求不能用sparse配置的，所以test出错就是错误配置了。所以这种改法应该没有问题。

It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.

follow you comments

* fix unified transformer dtype problem * fix win dtype bug * Fix plato-2 and plato-mini dtype bug * Fix plato-2 tokenization * Refine some doc * Add general k support for topk sampling * fix seed * minor fix * Fix unitransformer readme * topk kernel optimization * add unimo model and fix generate api * add 3 datasets for unimo-text Co-authored-by: Jiaqi Liu <liujiaqi06@baidu.com> Co-authored-by: liu zhengxi <380185688@qq.com>

* add guards for sm>=70 * drop guard to 530

backyes added 3 commits December 14, 2016 18:31

fix bug: if test and sparse_remote_update can not co-exsit, crash tra…

f53e8d7

…iner if necessary

Merge remote branch 'origin/develop' into sparse_bug

2294b81

more accurate to early stop train sparse model

45bd0a5

backyes assigned tianbingsz and reyoung Dec 15, 2016

backyes added Bug enhancement labels Dec 15, 2016

reyoung requested changes Dec 15, 2016

View reviewed changes

follow comments: more readable LOG

7c74304

reyoung approved these changes Dec 15, 2016

View reviewed changes

reyoung merged commit 04fb1fc into PaddlePaddle:develop Dec 16, 2016

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019

polish data reader API for release/1.5 (PaddlePaddle#891)

1a92bba

zhangyuqin1998 pushed a commit to zhangyuqin1998/Paddle that referenced this pull request Feb 20, 2025

add guards for __CUDA_ARCH__ >= 530 (PaddlePaddle#891)

1eef5c3

* add guards for sm>=70 * drop guard to 530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary#891

[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary#891
reyoung merged 4 commits intoPaddlePaddle:developfrom
backyes:sparse_bug

backyes commented Dec 14, 2016 •

edited

Loading

backyes commented Dec 15, 2016 •

edited

Loading

backyes commented Dec 15, 2016

reyoung commented Dec 15, 2016

backyes commented Dec 15, 2016

reyoung Dec 15, 2016

reyoung Dec 15, 2016

backyes Dec 15, 2016

backyes Dec 15, 2016

Labels

3 participants

Conversation

backyes commented Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

backyes commented Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

backyes commented Dec 15, 2016

reyoung commented Dec 15, 2016

backyes commented Dec 15, 2016

reyoung Dec 15, 2016

Choose a reason for hiding this comment

reyoung Dec 15, 2016

Choose a reason for hiding this comment

backyes Dec 15, 2016

Choose a reason for hiding this comment

backyes Dec 15, 2016

Choose a reason for hiding this comment

Labels

3 participants

backyes commented Dec 14, 2016 •

edited

Loading

backyes commented Dec 15, 2016 •

edited

Loading