Doubling all2all Performance with NVIDIA Collective Communication Library 2.12

Boosting AI Training with NVIDIA Collective Communication Library 2.12

Summary

The NVIDIA Collective Communication Library (NCCL) 2.12 release brings significant improvements to all2all collective communication performance, crucial for distributed AI training workloads like recommender systems and natural language processing. This article delves into the new features and enhancements of NCCL 2.12, particularly the introduction of PXN (PCI × NVLink), which combines NVLink and PCI communications to reduce traffic flow and optimize network traffic.

Introduction

Distributed AI training relies heavily on collective communication primitives such as all-gather, all-reduce, broadcast, reduce, and reduce-scatter. Among these, all2all is particularly challenging due to its high message volume and potential for network congestion. The NVIDIA Collective Communication Library (NCCL) is a Magnum IO library designed to accelerate these collective operations on GPUs.

What is NCCL?

NCCL is a library of multi-GPU collective communication primitives that are topology-aware and can be easily integrated into applications. It supports various collective operations including all-gather, all-reduce, broadcast, reduce, and reduce-scatter. NCCL is optimized for high bandwidth and low latency over PCIe, NVLink, Ethernet, and InfiniBand interconnects.

The New PXN Feature

The NCCL 2.12 release introduces a new feature called PXN, which stands for PCI × NVLink. This feature enables a GPU to communicate with a Network Interface Card (NIC) on the node through NVLink and then PCI, instead of going through the CPU using QPI or other inter-CPU protocols. This approach reduces traffic flow through second-tier spine switches and optimizes network traffic.

How PXN Works

PXN works by moving data from all GPUs on a node onto a single GPU for a given destination. This allows the network layer to aggregate messages using a new multireceive function. The remote CPU proxy can then send all messages as one as soon as they are all ready. For example, if a GPU on a node is performing an all2all operation and needs to receive data from all eight GPUs of a remote node, NCCL calls a multireceive with eight buffers and sizes. On the sender side, the network layer waits until all eight sends are ready, then sends all eight messages at once, significantly improving the message rate.

Benefits of PXN

The PXN feature brings several benefits to all2all collective communication:

Improved Performance: By aggregating messages and reducing traffic flow, PXN significantly improves all2all performance.
Flexibility: PXN gives more flexibility in using model parallelism, making it easier to split models across GPUs.
Reduced Congestion: By optimizing network traffic, PXN reduces the risk of network congestion and hotspots.

Performance Comparison

Operation	Without PXN	With PXN
all2all	High latency	Low latency
Message Rate	Lower	Higher
Network Traffic	Higher	Lower

Further Considerations

For those interested in exploring more about collective communication and its challenges, recent studies have proposed novel topologies and algorithms to optimize all-to-all collective communications on supercomputer-scale direct-connect interconnects. These studies highlight the importance of efficient collective communication in both machine learning and high-performance computing workloads.

Technical Details

For a deeper dive into the technical aspects of NCCL and its features, the NCCL release notes provide detailed information on the improvements and enhancements in each version. Understanding these technical details can help in better leveraging the capabilities of NCCL for specific use cases.

Conclusion

The NCCL 2.12 release, with its new PXN feature, offers significant improvements to all2all collective communication performance. By combining NVLink and PCI communications, PXN reduces traffic flow and optimizes network traffic, leading to better performance and flexibility in distributed AI training workloads. For those looking to boost their AI training efficiency, upgrading to NCCL 2.12 is a step in the right direction.

Boosting AI Training with NVIDIA Collective Communication Library 2.12#

Summary#

Introduction#

What is NCCL?#

The New PXN Feature#

How PXN Works#

Benefits of PXN#

Performance Comparison#

Further Considerations#

Technical Details#

Conclusion#