Summary

The NVIDIA Collective Communications Library (NCCL) has released version 2.23, which includes significant improvements for optimizing inter-GPU and multinode communication. This update is crucial for efficient parallel computing in AI and high-performance computing (HPC) applications. Key enhancements include the new Parallel Aggregated Trees (PAT) algorithm for AllGather and ReduceScatter operations, accelerated initialization, intranode user buffer registration, and a new profiler plugin API. These features aim to enhance the scalability and performance of NCCL, particularly in large-scale AI and HPC environments.

New Scaling Algorithm: Parallel Aggregated Trees (PAT)

Overview of PAT

The PAT algorithm is a variation of the Bruck algorithm, designed to achieve logarithmic scaling for AllGather and ReduceScatter operations. This is particularly beneficial for small to medium message sizes, where the improvement increases as the workload scales. PAT executes a binomial tree shifted for each rank, offering an advantage over similar algorithms like recursive doubling by working on any number of ranks, not just powers of two.

Key Features of PAT

  • Logarithmic Scaling: PAT features a logarithmic number of network steps for small sizes at scale, progressively increasing the number of network transfers as sizes increase to keep buffering needs minimal.
  • Flexibility: Unlike recursive doubling, PAT works on any number of ranks, making it versatile for various configurations.
  • Performance: Small to medium message sizes perform better with PAT, with improvements increasing as the workload scales.

Use Cases for PAT

PAT is particularly important for large language model (LLM) training, where pipeline parallelism and tensor parallelism are in dimensions orthogonal to data parallelism. The tensor parallelism dimension is usually aligned to the intranode NVLink connectivity, meaning that other dimensions will only have one GPU per node.

Accelerated Initialization

New Initialization API

The ncclCommInitRankScalable API is a new initialization function that enables leveraging multiple unique IDs during communicator creation. This addition avoids all-to-one communication patterns during initialization, providing a more scalable initialization performance.

Key Features of Accelerated Initialization

  • Multiple Unique IDs: Users can provide more than one unique ID to be used during the bootstrap, spreading the load across multiple IDs for constant bootstrap time at scale.
  • In-Band Networking: The fast network (IB/RoCE/…) can be used for out-of-band communication to speed up the two linear steps of initialization: bootstrap and allgather.
  • Performance Tuning: Elimination of some bootstrap collectives and performance tuning in the bootstrap step improve overall initialization performance.

Configuration Options

  • Enabling In-Band Networking: Use NCCL_OOB_NET_ENABLE=1 to enable in-band networking.
  • Specifying Network Interface: Use NCCL_OOB_NET_IFNAME to specify which interface should be used.

Intranode User Buffer Registration

Overview

NCCL 2.23 introduces intranode user buffer registration, allowing users to take advantage of registered user buffers for intranode operations. This feature enhances performance by reducing memory copies and improving data locality.

Benefits

  • Reduced Memory Copies: Registered buffers minimize the need for memory copies, improving performance.
  • Improved Data Locality: Data locality is enhanced, reducing the need for data movement and improving overall efficiency.

New Profiler Plugin API

Overview

The new profiler plugin API provides API hooks to measure fine-grain NCCL performance. This feature allows for detailed performance analysis and optimization.

Benefits

  • Fine-Grain Performance Measurement: Users can measure performance at a fine-grain level, enabling detailed analysis and optimization.
  • Customizable: The API is customizable, allowing users to tailor performance measurement to their specific needs.

Additional Updates

Bug Fixes and Minor Features

  • Asynchronous Graph Allocation: Calls to cudaMalloc and cudaMemcpy during graph allocation are now asynchronous, significantly speeding up graph capture.
  • Fatal IB Asynchronous Events: NCCL now uses fatal IB asynchronous events to stop network operations, helping catch link down errors and other fatal asynchronous events.
  • Improved Initialization Logs: Initialization logs now report the actual NCCL function being performed, informing users if NCCL is performing ncclCommInitRank or ncclCommSplit.

Conclusion

The NVIDIA Collective Communications Library (NCCL) version 2.23 brings significant enhancements to inter-GPU and multinode communication, crucial for AI and HPC applications. The new PAT algorithm, accelerated initialization, intranode user buffer registration, and new profiler plugin API are key features that improve scalability and performance. These updates are particularly beneficial for large-scale AI and HPC environments, where efficient parallel computing is essential. By leveraging these features, users can achieve better performance and scalability in their AI and HPC applications.