Summary
The NVIDIA Collective Communications Library (NCCL) has released version 2.23, which includes significant improvements for optimizing inter-GPU and multinode communication. This update is crucial for efficient parallel computing in AI and high-performance computing (HPC) applications. Key enhancements include the new Parallel Aggregated Trees (PAT) algorithm for AllGather and ReduceScatter operations, accelerated initialization, intranode user buffer registration, and a new profiler plugin API. These features aim to enhance the scalability and performance of NCCL, particularly in large-scale AI and HPC environments.
New Scaling Algorithm: Parallel Aggregated Trees (PAT)
Overview of PAT
The PAT algorithm is a variation of the Bruck algorithm, designed to achieve logarithmic scaling for AllGather and ReduceScatter operations. This is particularly beneficial for small to medium message sizes, where the improvement increases as the workload scales. PAT executes a binomial tree shifted for each rank, offering an advantage over similar algorithms like recursive doubling by working on any number of ranks, not just powers of two.
Key Features of PAT
- Logarithmic Scaling: PAT features a logarithmic number of network steps for small sizes at scale, progressively increasing the number of network transfers as sizes increase to keep buffering needs minimal.
- Flexibility: Unlike recursive doubling, PAT works on any number of ranks, making it versatile for various configurations.
- Performance: Small to medium message sizes perform better with PAT, with improvements increasing as the workload scales.
Use Cases for PAT
PAT is particularly important for large language model (LLM) training, where pipeline parallelism and tensor parallelism are in dimensions orthogonal to data parallelism. The tensor parallelism dimension is usually aligned to the intranode NVLink connectivity, meaning that other dimensions will only have one GPU per node.
Accelerated Initialization
New Initialization API
The ncclCommInitRankScalable
API is a new initialization function that enables leveraging multiple unique IDs during communicator creation. This addition avoids all-to-one communication patterns during initialization, providing a more scalable initialization performance.
Key Features of Accelerated Initialization
- Multiple Unique IDs: Users can provide more than one unique ID to be used during the bootstrap, spreading the load across multiple IDs for constant bootstrap time at scale.
- In-Band Networking: The fast network (IB/RoCE/…) can be used for out-of-band communication to speed up the two linear steps of initialization: bootstrap and allgather.
- Performance Tuning: Elimination of some bootstrap collectives and performance tuning in the bootstrap step improve overall initialization performance.
Configuration Options
- Enabling In-Band Networking: Use
NCCL_OOB_NET_ENABLE=1
to enable in-band networking. - Specifying Network Interface: Use
NCCL_OOB_NET_IFNAME
to specify which interface should be used.
Intranode User Buffer Registration
Overview
NCCL 2.23 introduces intranode user buffer registration, allowing users to take advantage of registered user buffers for intranode operations. This feature enhances performance by reducing memory copies and improving data locality.
Benefits
- Reduced Memory Copies: Registered buffers minimize the need for memory copies, improving performance.
- Improved Data Locality: Data locality is enhanced, reducing the need for data movement and improving overall efficiency.
New Profiler Plugin API
Overview
The new profiler plugin API provides API hooks to measure fine-grain NCCL performance. This feature allows for detailed performance analysis and optimization.
Benefits
- Fine-Grain Performance Measurement: Users can measure performance at a fine-grain level, enabling detailed analysis and optimization.
- Customizable: The API is customizable, allowing users to tailor performance measurement to their specific needs.
Additional Updates
Bug Fixes and Minor Features
- Asynchronous Graph Allocation: Calls to
cudaMalloc
andcudaMemcpy
during graph allocation are now asynchronous, significantly speeding up graph capture. - Fatal IB Asynchronous Events: NCCL now uses fatal IB asynchronous events to stop network operations, helping catch link down errors and other fatal asynchronous events.
- Improved Initialization Logs: Initialization logs now report the actual NCCL function being performed, informing users if NCCL is performing
ncclCommInitRank
orncclCommSplit
.
Conclusion
The NVIDIA Collective Communications Library (NCCL) version 2.23 brings significant enhancements to inter-GPU and multinode communication, crucial for AI and HPC applications. The new PAT algorithm, accelerated initialization, intranode user buffer registration, and new profiler plugin API are key features that improve scalability and performance. These updates are particularly beneficial for large-scale AI and HPC environments, where efficient parallel computing is essential. By leveraging these features, users can achieve better performance and scalability in their AI and HPC applications.