Unlocking Faster Deep Learning with NVIDIA’s cuBLAS 12.5: A Deep Dive into Grouped GEMM APIs
Summary
NVIDIA’s cuBLAS 12.5 introduces significant updates to enhance deep learning and high-performance computing workloads. The highlight of this release is the introduction of Grouped GEMM APIs, which generalize batched APIs by allowing different matrix sizes, transpositions, and scaling factors to be grouped and executed in one kernel launch. This article delves into the details of Grouped GEMM APIs, their performance benefits, and other key updates in cuBLAS 12.5.
Introduction to Grouped GEMM APIs
Grouped GEMM APIs are a new addition to the cuBLAS library, designed to improve the performance of deep learning and high-performance computing workloads. These APIs extend the functionality of batched APIs by enabling the grouping of different matrix sizes, transpositions, and scaling factors into a single kernel launch. This approach has shown significant speedups in certain scenarios, such as the generation phase of a mixture-of-experts (MoE) model.
Key Features of Grouped GEMM APIs
- Generalization of Batched APIs: Grouped GEMM APIs allow for the grouping of different matrix sizes, transpositions, and scaling factors, making them more versatile than traditional batched APIs.
- Variable Shapes and Transpositions: These APIs support variable shapes, transpositions, and scaling factors, providing greater flexibility in matrix operations.
- Performance Benefits: Grouped GEMM APIs have demonstrated a 1.2x speedup in certain scenarios, such as the generation phase of a MoE model with batch sizes of 8 and 64 and FP16 inputs and outputs.
New APIs for Grouped GEMM Support
Two new sets of APIs are available in the cuBLAS library for Grouped GEMM support:
- cublasgemmGroupedBatched: This API supports FP32 (including TF32) and FP64 precisions, where
<t>
is S or D for single and double precision, respectively. - cublasGemmGroupedBatchedEx: This API supports FP16, BF16, FP32 (including TF32), and FP64 precisions, offering a broader range of precision options.
Latest LLM Matmul Performance on NVIDIA GPUs
Recent performance snapshots show significant speedups for Llama 2 70B and GPT3 training phases on NVIDIA H100, H200, and L40S GPUs. The H200 GPU, in particular, demonstrates nearly 3x and 5x speedups compared to the A100 for Llama 2 70B and GPT3 training phases, respectively.
Performance Tuning and Benchmarking
The cuBLAS library uses a recommender system at runtime to dispatch the fastest available configuration for any user-requested matmuls. This system is trained on actual timing data from a wide range of problems and configurations.
For advanced users, the cublasLtMatmulAlgoGetHeuristic API enables performance tuning to achieve faster implementations. Examples of auto-tuning in cuBLAS can be found on the NVIDIA/CUDALibrarySamples repository.
Enhanced Functionality in cuBLASLt
Since cuBLAS 12.0, numerous enhancements have been introduced:
- Fused Epilogue Support: Parity between BF16 and FP16 precisions on NVIDIA Ampere and Ada.
- Additional Fused Epilogues: On NVIDIA Hopper and Ampere.
- Support for FP8: On Ada GPUs and performance updates on Ada L4, L40, and L40S.
- Removal of M, N, and Batch Size Limitations: Of cuBLASLt matmul API.
- Improved Performance of Heuristics Cache: For workloads with high eviction rate.
- cuBLAS Symbols: Available in CUDA Toolkit symbols for Linux repository.
Table: Comparison of Grouped GEMM APIs
API | Precision | Variable Shapes | Transpositions | Scaling Factors |
---|---|---|---|---|
cublasgemmGroupedBatched | FP32, FP64 | Yes | Yes | Yes |
cublasGemmGroupedBatchedEx | FP16, BF16, FP32, FP64 | Yes | Yes | Yes |
Table: Performance Comparison on NVIDIA GPUs
GPU | Llama 2 70B | GPT3 |
---|---|---|
H200 | 3x speedup | 5x speedup |
H100 | 2x speedup | 3x speedup |
L40S | 1.5x speedup | 2x speedup |
Conclusion
NVIDIA’s cuBLAS 12.5 introduces significant updates to enhance deep learning and high-performance computing workloads. The introduction of Grouped GEMM APIs, improved matrix multiplication performance on NVIDIA Hopper and Ada GPUs, and enhanced performance tuning options make cuBLAS 12.5 a powerful tool for developers. By leveraging these updates, developers can achieve faster and more efficient deep learning and HPC workloads.