Skip to content

Trainer send term signal#11220

Merged
typhoonzero merged 13 commits intoPaddlePaddle:developfrom
typhoonzero:trainer_send_term_signal
Jun 12, 2018
Merged

Trainer send term signal#11220
typhoonzero merged 13 commits intoPaddlePaddle:developfrom
typhoonzero:trainer_send_term_signal

Conversation

@typhoonzero
Copy link
Contributor

Fix #11077

Call exe.executor.complete() to tell pserver to mark current trainer as finished.

gongweibao
gongweibao previously approved these changes Jun 11, 2018
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rpc_server_->IncreaseBatchBarrier(kRequestSend);
} else if (varname == COMPLETE_MESSAGE) {
VLOG(3) << "sync: recv complete message";
rpc_server_->DecreaseClientNum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

@gongweibao
Copy link
Contributor

Only number maybe not stable. Do we need to change them to tolerant with client's multiple COMPLETE_MESSAGE BATCH_BARRIER_MESSAGE and so on in one barrier?

@typhoonzero
Copy link
Contributor Author

@gongweibao You were right, we may need to add trainer id to the request when we want to make all RPC calls retriable.

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@typhoonzero typhoonzero merged commit 34865f2 into PaddlePaddle:develop Jun 12, 2018
@typhoonzero typhoonzero deleted the trainer_send_term_signal branch June 12, 2018 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants