Conversation
c4d0515 to
e0156c1
Compare
999ea6c to
ef027b3
Compare
paddle/fluid/framework/tensor.h
Outdated
There was a problem hiding this comment.
Curious why need to add use_pinned_ to tensor, if we just want to use pinned memory for copying, just put this in the CopyTensor implement should OK?
There was a problem hiding this comment.
just put this in the CopyTensor implement
Do you mean, for CPU->GPU, to copy the data from CPU to pinned memory first and then copy from pinned memory to GPU?
There was a problem hiding this comment.
No of course. I was thinking about GPU->CPU copy. For CPU->GPU case, it seems this is the only way now.
There was a problem hiding this comment.
Why do we use the pinned memory just as a staging area but not use it directly? The computations of CPU can access the data of the pinned memory directly.
There was a problem hiding this comment.
In that case, can we set use_pinned_ as a global GFLAG, so that all allocations on host would use pinned memory, then we can test the overall performance boost.
There was a problem hiding this comment.
Because the memory copying between GPU and CPU is async, for GPU->CPU case, before reading data from the pinned memory we should ensure the copy has completed, so we should add sync operation. So using pinned memory is a little complex.
I plan to only put the input data into pinned memory and test the overall performance.
There was a problem hiding this comment.
Is this still WIP, can I review and merge this?
ef027b3 to
eaa90d3
Compare
paddle/fluid/memory/memory.cc
Outdated
| void* Alloc<platform::CUDAPlace>(platform::CUDAPlace place, size_t size, | ||
| bool use_pinned) { | ||
| void* ptr; | ||
| if (use_pinned) { |
There was a problem hiding this comment.
Alloc on CUDAPlace with use_pinned=false will return a pointer on device, but when calling with use_pinned=true will return a pointer on host, this is a little bit confusing.
There was a problem hiding this comment.
Yes, maybe we should add a new place (CUDAPinnedPlace).
paddle/fluid/framework/tensor_impl.h
Outdated
| #else | ||
| holder_.reset(new PlaceholderImpl<platform::CUDAPlace>( | ||
| boost::get<platform::CUDAPlace>(place), size, type)); | ||
| boost::get<platform::CUDAPlace>(place), size, type, use_pinned)); |
There was a problem hiding this comment.
holder_ pointer will be on host here may cause error.
There was a problem hiding this comment.
The host memory is page-locked and accessible to the device.
Refer: http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gab84100ae1fa1b12eaca660207ef585b
There was a problem hiding this comment.
Thank you for this information!
… feature/add_pinned_memory
ba1178e to
9e99446
Compare
typhoonzero
left a comment
There was a problem hiding this comment.
LGTM++, we can add the new place type in next PR.
WIP
The current CUDA Runtime Documentation states:
Asynchronous(Memcpy):
- For transfers from device memory to pageable host memory, the function will return only once the copy has completed.