Skip to content

[In Progress] Fix bug: enable sparse weigth setting in trainer_config_helper APIs#985

Closed
backyes wants to merge 1 commit intoPaddlePaddle:developfrom
backyes:fix_sparse_weight
Closed

[In Progress] Fix bug: enable sparse weigth setting in trainer_config_helper APIs#985
backyes wants to merge 1 commit intoPaddlePaddle:developfrom
backyes:fix_sparse_weight

Conversation

@backyes
Copy link
Contributor

@backyes backyes commented Dec 21, 2016

fix #948

为什么开这个pr:

  • 主要目标是通过这个pr和issue梳理清楚sparse相关问题,并记录文档方便其他开发者和用户追踪
  • 同时, 尝试fix 一些sparse weight的BUG(区别于sparse update的稀疏策略):

BUG相关:

  • 新接口暂时不支持设置sparse weight训练。

完善新接口对sparse支持,需要进一步分析的问题:

  • 确定nnz 默认值如何设置比价合理,依据是什么?
+Layer(
+    name = "layer1_5",
+    type = "fc",
+    size = 3,
+    active_type = "tanh",
+    inputs = Input("input",
+              learning_rate=0.01,
+              momentum=0.9,
+              decay_rate=0.05,
+              initial_mean=0.0,
+              initial_std=0.01,
+              format = "csc",
+              nnz = 4)
+)

从icode 老版本git 历史: commit af92dcde6afc4454354089e47870c7ef38dfeda3 看,上述nnz=4的配置得来?

  • 是否可以默认使用csr稀疏格式。 从理论上parameter weight的稀疏存储是一种内部格式,跟数据源无任何关系,因此不建议将老接口中的format参数,导出到用户。 (csr和csc格式在计算上应该没有什么性能差异? @reyoung 可以comment这个观点)

  • 老接口中,只有FCLayer和SelectiveFCLayer支持sparse weight的配置,其他layer均不支持。 因此, 是否需要将这个参数作为general的parameter attribute存在? 是否要将它实现到layer的特殊属性? (新接口设计初衷之一,也是为了简化用户理解的接口,所以我们应该尽量遵循这个准则来设计接口)

除了fix 接口问题之外,还有一些疑问:

  • 理论上, parameter weight 设置成sparse的特性后, 一般并不能进一步优化训练阶段的forward和backward计算耗时。 因为一般data是sparse的配置后,forward应该已经是sparse计算了的了(backward是否尚需确认?), 另外再加上sparse update和sparse remote update使能后,再单独设置parameter weight的sparse 应该没有什么性能提升?

  • 如果parameter weight 设置成sparse的特性,是为了生成稀疏的模型用户预测,那么实际上应该可以通过仅仅在save model阶段进行稀疏存储即可,无需在训练计算阶段进行这种稀疏化处理?

因此它存在的价值是什么?

@backyes
Copy link
Contributor Author

backyes commented Dec 21, 2016

@backyes
Copy link
Contributor Author

backyes commented Dec 22, 2016

潜在 BUG Update: (SHA1: 28c5010

  • 如果同时配置sparse_update, sparse weight, 且输入sparse训练数据,会报
void GpuMatrix::mul(const Matrix& a,
                    const Matrix& b,
                    real scaleAB,
                    real scaleT) {
  const auto a_ptr = dynamic_cast<const GpuMatrix*>(&a);
  const auto b_ptr = dynamic_cast<const GpuMatrix*>(&b);
  const auto a_ptr_s = dynamic_cast<const GpuSparseMatrix*>(&a);
  const auto b_ptr_s = dynamic_cast<const GpuSparseMatrix*>(&b);

  if (a_ptr && b_ptr) {
    mul(*a_ptr, *b_ptr, scaleAB, scaleT);
  } else if (a_ptr_s && b_ptr) {
    mul(*a_ptr_s, *b_ptr, scaleAB, scaleT);
  } else if (a_ptr && b_ptr_s) {
    mul(*a_ptr, *b_ptr_s, scaleAB, scaleT);
  } else {
    LOG(FATAL) << "Not supported";
  }
}

最后一个分支错误。

  • 如果使能sparse 训练数据、使能sparse weigth,关闭sparse_updater, 且trainer_count=1, 系统会crash在以下:
27	  return momentum_;
(gdb) bt
#0  0x0000000000dfacaa in paddle::ParameterConfig::momentum (this=0x0)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/build/proto/ParameterConfig.pb.h:727
#1  0x0000000000e0ed86 in paddle::SparseMomentumParameterOptimizer::init (this=0x22403d0, numRows=6,
    config=0x0)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/parameter/FirstOrderOptimizer.cpp:44
#2  0x0000000000c3eca3 in paddle::SgdLocalUpdater::init (this=0x2240330, parameters=...)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/ParameterUpdater.h:69
#3  0x0000000000c35956 in paddle::Trainer::init (this=0x7fffffffd650, config=..., testing=false,
    gradientMachine=..., dataProvider=..., testDataProvider=...)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/Trainer.cpp:245
#4  0x0000000000a4ea26 in main (argc=11, argv=0x7fffffffda18)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/TrainerMain.cpp:96
(gdb) f 2
#2  0x0000000000c3eca3 in paddle::SgdLocalUpdater::init (this=0x2240330, parameters=...)
    at /home/wangyanfei/paddle_internal_release_tools/idl/paddle/Paddle/paddle/trainer/ParameterUpdater.h:69
69	    optimizer_->init(parameters_.size(), nullptr);
(gdb)

原因是, trainer_count=1(实际上应该是关闭sparse updater) 会使能local updater, 它不支持基于参数的优化策��,只能支持全局的优化策略。 (问题类似L1正则的问题)

也间接说明, sparse momentum不能与sgdLocalUpdater共存。

    • 如果使能sparse 训练数据、使能sparse weigth,关闭sparse_updater, 且trainer_count > 1, 系统会crash
INFO 2016-12-22 23:14:12,307 networks.py:1472] The output order is [__cost_0__]
I1222 23:14:12.309047  6331 Trainer.cpp:176] trainer mode: Normal
*** Aborted at 1482419652 (unix time) try "date -d @1482419652" if you are using GNU date ***
PC: @           0xab88a4 paddle::VectorT<>::getSize()
*** SIGSEGV (@0x30) received by PID 6331 (TID 0x7f9f1dfd9780) from PID 48; stack trace: ***
    @     0x7f9f1dbb3160 (unknown)
    @           0xab88a4 paddle::VectorT<>::getSize()
    @           0xdf6e68 paddle::Parameter::setMat()
    @           0xbc5f54 paddle::Parameter::enableType()
    @           0xbd38e9 paddle::parameterInitNN()
    @           0xbcf57a _ZNSt5_BindIFPFviPN6paddle9ParameterEPSt6vectorISt10shared_ptrIS1_ESaIS5_EEESt12_PlaceholderILi1EESB_ILi2EES8_EE6__callIvJOiOS2_EJLm0ELm1ELm2EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
    @           0xbcd582 _ZNSt5_BindIFPFviPN6paddle9ParameterEPSt6vectorISt10shared_ptrIS1_ESaIS5_EEESt12_PlaceholderILi1EESB_ILi2EES8_EEclIJiS2_EvEET0_DpOT_
    @           0xbcac24 std::_Function_handler<>::_M_invoke()
    @           0xbd7d32 std::function<>::operator()()
    @           0xbd412b paddle::NeuralNetwork::init()
    @           0xbbef84 paddle::TrainerThread::TrainerThread()
    @           0xbbcf25 paddle::MultiGradientMachine::MultiGradientMachine()
    @           0xbe18bf paddle::GradientMachine::create()
    @           0xc3b80c paddle::TrainerInternal::init()
    @           0xc3526b paddle::Trainer::init()
    @           0xa4ea26 main
    @     0x7f9f1c9d8bd5 __libc_start_main
    @           0xa4df29 (unknown)
local.sh: line 15:  6331 Segmentation fault      (core dumped) PYTHONPATH=./:../../../python ../../../../build/paddle/trainer/paddle_trainer --use_gpu=0 --config=./sparse_trainer_config.py --saving_period=1 --test_period=0 --num_passes=4 --dot_period=2 --log_period=20 --trainer_count=2 --saving_period_by_batches=5000 --local=1

crash 在多卡gradient 初始化那里

  • 如果使能sparse 训练数据、使能sparse weigth, 使能sparse_updater, 且trainer_count > 1, 系统会crash
1222 23:16:26.934692 18161 Trainer.cpp:125] ignore sparse_remote_update=true due to  --local=true
I1222 23:16:26.934728 18161 Trainer.cpp:173] trainer mode: SgdSparseCpuTraining
F1222 23:16:26.973275 18161 Parameter.cpp:219] Check failed: height * width == bufs_[pType]->getSize() (290916736 vs. 4)
*** Check failure stack trace: ***
    @           0xf2aba4  google::LogMessage::Fail()
    @           0xf2aafc  google::LogMessage::SendToLog()
    @           0xf2a591  google::LogMessage::Flush()
    @           0xf2d352  google::LogMessageFatal::~LogMessageFatal()
    @           0xdf781d  paddle::Parameter::setMat()
    @           0xbc5f54  paddle::Parameter::enableType()
    @           0xbbc6fe  _ZZN6paddle20MultiGradientMachineC1ERKNS_11ModelConfigEbENKUliPNS_9ParameterEE_clEiS5_
    @           0xbc1ef1  _ZNSt17_Function_handlerIFviPN6paddle9ParameterEEZNS0_20MultiGradientMachineC1ERKNS0_11ModelConfigEbEUliS2_E_E9_M_invokeERKSt9_Any_dataiS2_
    @           0xbd7d32  std::function<>::operator()()
    @           0xbd412b  paddle::NeuralNetwork::init()
    @           0xbbca86  paddle::MultiGradientMachine::MultiGradientMachine()
    @           0xbe18bf  paddle::GradientMachine::create()
    @           0xc3b80c  paddle::TrainerInternal::init()
    @           0xc3526b  paddle::Trainer::init()
    @           0xa4ea26  main
    @     0x7ff5bd545bd5  __libc_start_main
    @           0xa4df29  (unknown)
local.sh: line 15: 18161 Aborted                 (core dumped) PYTHONPATH=./:../../../python ../../../../build/paddle/trainer/paddle_trainer --use_gpu=0 --config=./sparse_trainer_config.py --saving_period=1 --test_period=0 --num_passes=4 --dot_period=2 --log_period=20 --trainer_count=2 --saving_period_by_batches=5000 --local=1

因为要初始化一个 (matType == MAT_SPARSE_ROW_IDS) 类型矩阵 (为什么单卡没有这个矩阵,尚不明确)?

@backyes
Copy link
Contributor Author

backyes commented Dec 26, 2016

Update:

  • 增强:
    sparse update的配置要check一下, 仅仅使能第一个隐层的weight update,其他的隐层是什么效果,未知。
@luotao1
Copy link
Contributor

luotao1 commented Feb 1, 2019

感谢您给PaddlePaddle贡献代码。由于Paddle V1/V2版本已不再维护,相关代码也已从develop分支上删除,因此关闭您的PR,欢迎您向Paddle最新版-Fluid贡献代码。
Thanks for contributing to PaddlePaddle! Since V1/V2 will not be maintained anymore, and related codes have been deleted from develop branch as well, we close this PR. Welcome to contribute to Fluid——the latest version of PaddlePaddle.

@luotao1 luotao1 closed this Feb 1, 2019
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants