Accelerating HPCG Benchmark with NVIDIA Math Sparse Libraries

Summary

The NVIDIA HPCG benchmark program is designed to evaluate the performance of high-performance computing (HPC) systems by simulating real-world applications such as sparse matrix calculations. This article explores how NVIDIA’s high-performance math libraries, cuSPARSE and NVPL Sparse, are used to accelerate the HPCG benchmark on NVIDIA GPUs and Grace CPUs.

Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries

The High Performance Conjugate Gradients (HPCG) benchmark is a critical tool in the field of high-performance computing (HPC). It is designed to test the performance of HPC systems by simulating real-world applications such as sparse matrix calculations. The HPCG benchmark is particularly useful because it stresses the internal interconnect and memory subsystem of supercomputers, providing a more comprehensive picture of their computing performance compared to other benchmarks like LINPACK.

The Role of NVIDIA Math Sparse Libraries

NVIDIA’s high-performance math libraries, cuSPARSE and NVPL Sparse, play a crucial role in accelerating the HPCG benchmark on NVIDIA GPUs and Grace CPUs. These libraries are optimized for sparse linear algebra operations, which are essential for iterative algorithms like the preconditioned conjugate gradient (PCG) method used in the HPCG benchmark.

cuSPARSE: This library is optimized for NVIDIA GPU architectures and supports a wide range of functionalities, including sparse matrix-vector multiplication (SpMV), sparse matrix-matrix multiplication (SpMM), and sparse matrix triangular solvers (SpSV). cuSPARSE offers configurable storage formats such as CSR and ELLPACK, flexible data types for input, output, and compute, and various indexing options like 32-bit and 64-bit. Its memory management features and extensive consistency checks further enhance reliability and performance consistency across diverse computing scenarios.
NVPL Sparse: This library is designed for aarch64 architectures such as the NVIDIA Grace CPU. It is integral to accelerating sparse linear algebra operations on these platforms, ensuring optimal performance for SpMV and SpSV operations.

Enhancing Performance with NVIDIA Math Libraries

The integration of cuSPARSE and NVPL Sparse into the NVIDIA HPCG benchmark program significantly enhances its performance. These libraries empower users to achieve peak performance on NVIDIA GPU and Grace CPU architectures, advancing supercomputing capabilities to effectively manage intricate scientific simulations and computations.

Sliced-ELLPACK Storage Format: NVIDIA has introduced the sliced-ELLPACK storage format for sparse matrices within the NVIDIA HPCG benchmark program. This format minimizes computational overhead by reducing zero-padding in lower and upper triangular matrices compared to traditional CSR formats. The adoption of sliced-ELLPACK has demonstrated notable performance improvements, including faster SpMV and SpSV operations on the NVIDIA DGX H100 platform.
Efficient Sparse Matrix Updates: The NVIDIA HPCG benchmark program implements specific strategies to minimize unnecessary memory accesses and computational redundancies. For instance, by restricting matrix operations to relevant portions during iterative steps and leveraging specialized APIs like cusparseSpSV_UpdateMatrix and nvpl_sparse_update_matrix, the NVIDIA HPCG benchmark program achieves efficient sparse matrix updates and computations.

Overlapping Computation and Communication

The NVIDIA HPCG benchmark program overlaps computation with communication tasks during critical operations such as SpMV and SYMGS computations. This strategy involves transferring boundary data from the GPU to the CPU, performing MPI send/receive operations with neighboring processes, and transferring the results back to the GPU. The use of CUDA streams enables concurrent execution of communication copies in separate streams from computation kernels, thereby reducing idle time and improving overall throughput.

Heterogeneous Computing Capabilities

The NVIDIA HPCG benchmark program extends its optimization strategies to heterogeneous computing environments, seamlessly integrating GPUs and NVIDIA Grace CPUs. This approach involves assigning an MPI rank to each GPU and one or more MPI ranks to the Grace CPU. To fully maximize the utilization of every aspect of the system, the strategy is to allocate a larger local problem size to the GPU compared to the Grace CPU. This ensures that the computational strengths of both the GPU and the CPU are fully leveraged. During MPI blocking communication steps like MPI_Allreduce, this approach helps maintain balanced workloads across the system components, optimizing overall performance and minimizing idle time for any part of the system.

Performance Comparison

The performance comparison between the NVIDIA HPCG benchmark program and the official HPCG benchmark on the NVIDIA GH200-480GB highlights the benefits of NVIDIA’s optimizations. The NVIDIA HPCG CPU-only configuration demonstrated a significant performance improvement over the official HPCG benchmark due to CPU software optimization. The heterogeneous GPU and Grace CPU implementation showed a performance boost compared to the GPU-only setup when the Grace CPU handles a smaller problem that can overlap with the GPU workload.

Table: Key Features of NVIDIA HPCG Benchmark Program

Feature	Description
cuSPARSE	Optimized for NVIDIA GPU architectures, supports SpMV, SpMM, and SpSV operations.
NVPL Sparse	Designed for aarch64 architectures like NVIDIA Grace CPU, accelerates sparse linear algebra operations.
Sliced-ELLPACK	Minimizes computational overhead by reducing zero-padding in sparse matrices.
Overlap of Computation and Communication	Uses CUDA streams to reduce idle time and improve throughput.
Heterogeneous Computing	Seamlessly integrates GPUs and NVIDIA Grace CPUs, optimizing performance in heterogeneous environments.

Table: Performance Comparison

Configuration	Performance Improvement
NVIDIA HPCG CPU-only	Significant improvement over official HPCG benchmark due to CPU software optimization.
Heterogeneous GPU and Grace CPU	Performance boost compared to GPU-only setup when Grace CPU handles smaller problem.

Conclusion

The NVIDIA HPCG benchmark program, powered by NVIDIA’s high-performance math libraries cuSPARSE and NVPL Sparse, offers a robust tool for evaluating the performance of HPC systems. By leveraging these libraries, the benchmark program achieves optimal performance for sparse matrix-vector multiplication and sparse matrix triangular solvers on NVIDIA GPUs and Grace CPUs. The program’s ability to overlap computation and communication, along with its support for heterogeneous computing environments, makes it an invaluable resource for advancing supercomputing capabilities. As HPC continues to evolve, tools like the NVIDIA HPCG benchmark program will remain critical in pushing the boundaries of what is possible in scientific computing.

Summary#

Accelerating the HPCG Benchmark with NVIDIA Math Sparse Libraries#

The Role of NVIDIA Math Sparse Libraries#

Enhancing Performance with NVIDIA Math Libraries#

Overlapping Computation and Communication#

Heterogeneous Computing Capabilities#

Performance Comparison#

Table: Key Features of NVIDIA HPCG Benchmark Program#

Table: Performance Comparison#

Conclusion#