From the course: NVIDIA Certified Associate AI Infrastructure and Operations (NCA-AIIO) Cert Prep

Unlock this course with a free trial

Join today to access over 25,300 courses taught by industry experts.

NVIDIA collective communications library (NCCL)

NVIDIA collective communications library (NCCL)

Let's talk about NVIDIA collective communications library. There are scenarios in your in model training or model inferences that multiple GPUs may have to communicate with each other. Now these GPU may exist in the same system or they may be outside of that system. So should they use NVLink, should they use NVSwitch, should they use RDMA? their communication should happen between them. So rather than we taking care of all this programmatically we rely on this library which would take care of all communication for your GPUs. So let's focus on NVIDIA collective communications library NCCL. So NCCL is a multi GPU communication library which provide abstraction and optimization pattern for communication between many GPUs. Rather than I initiating a connection and then closing the connection and then communicating through different mode of transport, I would let this NCCL use the underlying architecture and then do optimized communication on that. So here we have host 1. Host 1 GPUs can…

Contents