design doc for parallel_do.md#8425
Conversation
| fc_grad, allreduce(places, scopes, w1_grad), | ||
| fc_grad, allreduce(places, scopes, w2_grad) | ||
| } | ||
| block3 { |
There was a problem hiding this comment.
I think it's better to indicate each blocks' parents.
| .AsDuplicable(); | ||
| AddInput(kPlaces, "Devices used for parallel processing"); | ||
| AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable(); | ||
| AddOutput(kParallelScopes, |
There was a problem hiding this comment.
kParallelScopes seems to indicate that there are multiple scopes, but the description says Container, which is a single container:
- does container mean scope?
- is there a single scope or multiple scopes?
There was a problem hiding this comment.
- yes
- one scope for each device
There was a problem hiding this comment.
Maybe change "container" to "scope" and make "one scope for each device" clear? :)
| ``` | ||
| In the forward pass | ||
| | Split input onto different devices | ||
| | Copy parameter to onto different devices |
There was a problem hiding this comment.
It seems that "Copy parameter to onto different devices" is only done in the first time the parallel do OP happens. Maybe we need to make it clear.
There was a problem hiding this comment.
The current version does this at every iteration
| | Merge output from different devices | ||
|
|
||
| In the backward pass | ||
| | Split output@grad onto different devices |
There was a problem hiding this comment.
Is it split or duplicate?
| | Split output@grad onto different devices | ||
| |||| Compute backward pass in parallel | ||
| | accumulate param@grad from different devices to the first device | ||
| | Merge input@grad from different devices |
There was a problem hiding this comment.
Is it input@grad or param@grad?
There was a problem hiding this comment.
Another step, Copy param@grad to the place of parallel_do_op, should be added here
| # get embedding feature on CPU | ||
| feature = some_cpu_only_op(data) | ||
|
|
||
| gpu_places = get_place(use_gpu=True) |
There was a problem hiding this comment.
Can the Python API specify 5 parallel CPU thread when there is no GPU?
doc/design/parallel_do.md
Outdated
| with pd.do(): | ||
| read_input(feature) | ||
| prediction = my_net(feature) | ||
| write_output(activation) |
There was a problem hiding this comment.
write_output(activation) or write_output(prediction)?
There was a problem hiding this comment.
typo. Thanks for pointing it out.
| read_input(feature) | ||
| prediction = my_net(feature) | ||
| write_output(activation) | ||
| prediction = pd() |
There was a problem hiding this comment.
Does the Python API support multiple outputs? If so can you provide an example?
doc/design/parallel_do.md
Outdated
| ```python | ||
| pd = ParallelDo(gpu_places) | ||
| with pd.do(): | ||
| feature = pre_fetch(gpu_places) |
There was a problem hiding this comment.
Sorry I don't understand how pre_fetch will work here, since the pre_fetch is inside the child block of parallel do, it will not run until parallel do run. Isn't that too late for prefetching?
There was a problem hiding this comment.
This op hasn't been implemented yet. But there should be a background thread adding the data to the fetching queue before this OP is called.
| write_output(activation) | ||
| ``` | ||
|
|
||
| ### forward: Copy parameter to onto different devices |
There was a problem hiding this comment.
Is "Copy parameter to onto different devices" a performance improvement? I agree that this is a more graceful approach, but isn't "Copy parameter to onto different devices" will only run once, so maybe the performance cost is negligible?
Looks that in the body of this section there are other optimizations besides "Copy parameter to onto different devices", maybe need a better title?
Maybe I have this question because I did not fully understand it.
There was a problem hiding this comment.
the current implementation of backward only supports updating gradient at one place. So we need to copy the updated parameters at every iterations.
| } | ||
| ``` | ||
|
|
||
| ## Proformance Imporvement |
There was a problem hiding this comment.
Just a minor typo here. Proformance -> Performance
| ``` | ||
| In the forward pass | ||
| | Split input onto different devices | ||
| | Copy parameter to onto different devices |
There was a problem hiding this comment.
You can drop the to -> Copy parameter onto different devices
No description provided.