Mpi enabled by seiriosPlus · Pull Request #9490 · PaddlePaddle/Paddle

seiriosPlus · 2018-03-29T04:24:43Z

design doc about #9405

seiriosPlus · 2018-03-29T10:16:51Z

@Yancey1989 @typhoonzero I write a design doc about Mpi enabled, thanks for the review.

typhoonzero · 2018-03-30T01:51:54Z

Please fix the style check first

typhoonzero · 2018-04-11T08:04:30Z

doc/fluid/design/dist_train/mpi_enabled_design.md

@@ -0,0 +1,33 @@
+#MPI-enabled PaddlePaddle Design doc


Need a space after "#"

typhoonzero · 2018-04-11T08:18:04Z

doc/fluid/design/dist_train/mpi_enabled_design.md

+#MPI-enabled PaddlePaddle Design doc
+
+# Background
+Now, PaddlePaddle Fluid with Distribution has relatively large network bottleneck, We want to use RDMA and GPUDriect to improve and solve it, so we enabled the features to PaddlePaddle with the help of MPI.


When we do distribute multi GPU training, the communication overhead between servers become the major bottleneck, because of the following reasons:

Must copy at least once from GPU to CPU memory so that the data can be ready to transfer. And for pserver side, copy data from CPU to GPU introduce more overhead.

GPU->CPU data transfer is 10 times slower than data transfer between GPUs or between PCIe devices.

TCP connections can not make full use of RDMA 100Gb devices.

typhoonzero · 2018-04-11T08:18:21Z

doc/fluid/design/dist_train/mpi_enabled_design.md

+# Background
+Now, PaddlePaddle Fluid with Distribution has relatively large network bottleneck, We want to use RDMA and GPUDriect to improve and solve it, so we enabled the features to PaddlePaddle with the help of MPI.
+
+We will introduce Open MPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle:


We will use

Open MPI => OpenMPI

typhoonzero · 2018-04-11T08:19:30Z

doc/fluid/design/dist_train/mpi_enabled_design.md

+We will introduce Open MPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle:
+1. Enable RDMA with PaddlePaddle, which bring high-performance low latency networks.
+2. Enable GPUDriect with PaddlePaddle, which bring the highest throughput and lowest latency GPU read and write.
+


Need details of the design.

typhoonzero · 2018-04-11T08:20:47Z

doc/fluid/design/dist_train/mpi_enabled_design.md

+### mpi_module
+We will build a new module to package MPI send and receive process. MPI send and recvice are defferent to gRPC, the MPI [recvice](https://www.open-mpi.org/doc/v1.8/man3/MPI_Irecv.3.php) must know receive buffer size and receive buffer element. For this reason, We have to make conmunications twice, the first one is to send metadata about gradient through gRPC, the second one is the real conmunications through MPI which send gradient data to mpi_listenandserve_op.
+The detail flow is below:
+![](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/dist_train/src/mpi_module.png)


picture is not shown

typhoonzero · 2018-04-11T08:29:55Z

doc/fluid/design/dist_train/mpi_enabled_design.md

+Launch the script using the ```mpirun``` launcher, For example: ```mpirun -np 3 -hosts node1,node2,node3 python train.py```. By doing this, We can number the actors (trainer/pserver/master) with o .. (n-1). The node's number is the Rank of the calling process in a group of comm (integer),  The MPI processes identify each other using a Rank ID. We have to create a mapping between PaddlePaddle's actors and their Rank ID so that we can communicate with the correct destinations when using MPI operations.
+    **We have to store the Rank ID and the mapping in global variables.**
+
+## New OP/MODULE


Just list things need to be changed like:

mpi_send_op

mpi_serv_op

modify transpiler to support using MPI or not

compile arg to enable MPI support

...

then discuss the details.

typhoonzero · 2018-04-17T06:03:29Z

paddle/fluid/operators/detail/mpi_client.cpp

@@ -0,0 +1,29 @@
+


Separate the implement and design to 2 PRs.

I have deleted the implement now and will rebuild it in another PR.

typhoonzero

LGTM

… mpi_enabled

seiriosPlus added 4 commits March 27, 2018 22:07

mpi enabled design doc

3b39815

mpi sever code

b1adcd4

mpi tools

94ce020

mpi enabled design doc

9d256dd

seiriosPlus requested review from Yancey0623 and typhoonzero March 29, 2018 10:15

send mpi and rpc framework

900b3e9

typhoonzero reviewed Apr 11, 2018

View reviewed changes

structure optimized, more detail

5bf671b

typhoonzero reviewed Apr 17, 2018

View reviewed changes

Separate the implement to another PR later

10669f1

typhoonzero approved these changes Apr 17, 2018

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d2ba05a

… mpi_enabled

seiriosPlus merged commit 6f3aa72 into PaddlePaddle:develop Apr 17, 2018

seiriosPlus deleted the mpi_enabled branch April 25, 2018 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpi enabled#9490

Mpi enabled#9490
seiriosPlus merged 8 commits intoPaddlePaddle:developfrom
seiriosPlus:mpi_enabled

seiriosPlus commented Mar 29, 2018

seiriosPlus commented Mar 29, 2018

typhoonzero commented Mar 30, 2018

typhoonzero Apr 11, 2018

seiriosPlus Apr 16, 2018

typhoonzero Apr 11, 2018

seiriosPlus Apr 16, 2018

typhoonzero Apr 11, 2018

typhoonzero Apr 11, 2018

seiriosPlus Apr 16, 2018

typhoonzero Apr 11, 2018

seiriosPlus Apr 16, 2018

typhoonzero Apr 11, 2018

seiriosPlus Apr 16, 2018

typhoonzero Apr 11, 2018

seiriosPlus Apr 16, 2018

typhoonzero Apr 17, 2018

seiriosPlus Apr 17, 2018

typhoonzero left a comment

Labels

2 participants

Conversation

seiriosPlus commented Mar 29, 2018

seiriosPlus commented Mar 29, 2018

typhoonzero commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

Labels

2 participants