Skip to content

[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary#891

Merged
reyoung merged 4 commits intoPaddlePaddle:developfrom
backyes:sparse_bug
Dec 16, 2016
Merged

[Sparse Bug] Test and sparse_remote_update can not co-exsit, crash trainer if necessary#891
reyoung merged 4 commits intoPaddlePaddle:developfrom
backyes:sparse_bug

Conversation

@backyes
Copy link
Contributor

@backyes backyes commented Dec 14, 2016

fix #660

新接口已经不允许disable test datprovider,所以本patch 还要一些改动,

关联问题:

predict和sparse_remote_updater也不能共存,本patch暂时不解决这个预测问题。

@backyes
Copy link
Contributor Author

backyes commented Dec 15, 2016

这个补丁能早期检测出sparse模型配置的异常,避免 #660 问题的出现。

原理:

  • sparse 集群模型下, 会有一个prefetch逻辑,它根据当前处理的数据情况,选择从pserver拖取适当的训练参数,优化通信和整体的update效率。 但是,test过程是一个forward过程,没有backward和update过程,因此当前将test和trainer公用一套逻辑优化在gpu上对内存的消耗的设计,使得test的forward会share一部分prefetch的逻辑和数据,因此集群sparse训练异常会指导forward才被捕获。

本patch,检测模型是否sparse,并提前报错。

同时, 部分用户使用test逻辑进行predict的case也有潜在风险,本patch都能避免此类问���发生。另外用户使用py_paddle进行预测的case,不能被此patch捕获。

@backyes
Copy link
Contributor Author

backyes commented Dec 15, 2016

另外,从这个patch的定位过程总结来看,算法上test、predict、train三个阶段,如果能做到顶层代码的分离,尽量分离,底层模块尽量功能单一、简洁,顶层通过组装形式share底层逻辑最好,否则对理解问题比较困难。 当前整个sparse的设计,也尤其重构后的sparse逻辑,为了优化分开逻辑,将ids耦合到底层通信protobuf逻辑,也带来了一些问题,算法和通信过于耦合,有些顶层配置的异常直接暴露到最底层逻辑,对定位和理解问题避免的相对困难。

@reyoung
Copy link
Collaborator

reyoung commented Dec 15, 2016

直接log(FATAL)不太好吧。我们应该可以改一下trainer来支持这个东西?

@backyes
Copy link
Contributor Author

backyes commented Dec 15, 2016

@reyoung

  • 这个错误根源来自gradient里面。 因此从trainer顶层貌似不��改。

  • 如果warning忽略,
    * 如果自动重写sparse配置(比如disable),这会造成非常大的影响,性能变慢。
    * 如果自动disable test过程,那么距离用户理解的接口,差异非常大

  • 目前直接FATAL,感觉其实还好。只有使用remote sparse配置的用户才会收到影响。

即便要改,我觉得也应该是另一个pr来解决。

LOG(FATAL) << "It's prohibited to set sparse_remote_update "
<< "in some layers if testing will be under going "
<< "in the middle of training. You can do testing "
<< "within separate process.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以用grammarly检查一下语法。 不过,在这里面报错,会不会在不同进程里也会报错呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test过程是要求不能用sparse配置的,所以test出错就是错误配置了。所以这种改法应该没有问题。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's prohibited to set sparse_remote_update when doing train and test jobs in the same process. You could run paddle --job=test in a separate process.

follow you comments

@reyoung reyoung merged commit 04fb1fc into PaddlePaddle:develop Dec 16, 2016
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
* fix unified transformer dtype problem

* fix win dtype bug

* Fix plato-2 and plato-mini dtype bug

* Fix plato-2 tokenization

* Refine some doc

* Add general k support for topk sampling

* fix seed

* minor fix

* Fix unitransformer readme

* topk kernel optimization

* add unimo model and fix generate api

* add 3 datasets for unimo-text

Co-authored-by: Jiaqi Liu <liujiaqi06@baidu.com>
Co-authored-by: liu zhengxi <380185688@qq.com>
zhangyuqin1998 pushed a commit to zhangyuqin1998/Paddle that referenced this pull request Feb 20, 2025
* add guards for sm>=70

* drop guard to 530
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 participants