Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates

Unlocking Faster Deep Learning with NVIDIA’s cuBLAS 12.5: A Deep Dive into Grouped GEMM APIs

Summary

NVIDIA’s cuBLAS 12.5 introduces significant updates to enhance deep learning and high-performance computing workloads. The highlight of this release is the introduction of Grouped GEMM APIs, which generalize batched APIs by allowing different matrix sizes, transpositions, and scaling factors to be grouped and executed in one kernel launch. This article delves into the details of Grouped GEMM APIs, their performance benefits, and other key updates in cuBLAS 12.5.

Introduction to Grouped GEMM APIs

Grouped GEMM APIs are a new addition to the cuBLAS library, designed to improve the performance of deep learning and high-performance computing workloads. These APIs extend the functionality of batched APIs by enabling the grouping of different matrix sizes, transpositions, and scaling factors into a single kernel launch. This approach has shown significant speedups in certain scenarios, such as the generation phase of a mixture-of-experts (MoE) model.

Key Features of Grouped GEMM APIs

Generalization of Batched APIs: Grouped GEMM APIs allow for the grouping of different matrix sizes, transpositions, and scaling factors, making them more versatile than traditional batched APIs.
Variable Shapes and Transpositions: These APIs support variable shapes, transpositions, and scaling factors, providing greater flexibility in matrix operations.
Performance Benefits: Grouped GEMM APIs have demonstrated a 1.2x speedup in certain scenarios, such as the generation phase of a MoE model with batch sizes of 8 and 64 and FP16 inputs and outputs.

New APIs for Grouped GEMM Support

Two new sets of APIs are available in the cuBLAS library for Grouped GEMM support:

cublasgemmGroupedBatched: This API supports FP32 (including TF32) and FP64 precisions, where <t> is S or D for single and double precision, respectively.
cublasGemmGroupedBatchedEx: This API supports FP16, BF16, FP32 (including TF32), and FP64 precisions, offering a broader range of precision options.

Latest LLM Matmul Performance on NVIDIA GPUs

Recent performance snapshots show significant speedups for Llama 2 70B and GPT3 training phases on NVIDIA H100, H200, and L40S GPUs. The H200 GPU, in particular, demonstrates nearly 3x and 5x speedups compared to the A100 for Llama 2 70B and GPT3 training phases, respectively.

Performance Tuning and Benchmarking

The cuBLAS library uses a recommender system at runtime to dispatch the fastest available configuration for any user-requested matmuls. This system is trained on actual timing data from a wide range of problems and configurations.

For advanced users, the cublasLtMatmulAlgoGetHeuristic API enables performance tuning to achieve faster implementations. Examples of auto-tuning in cuBLAS can be found on the NVIDIA/CUDALibrarySamples repository.

Enhanced Functionality in cuBLASLt

Since cuBLAS 12.0, numerous enhancements have been introduced:

Fused Epilogue Support: Parity between BF16 and FP16 precisions on NVIDIA Ampere and Ada.
Additional Fused Epilogues: On NVIDIA Hopper and Ampere.
Support for FP8: On Ada GPUs and performance updates on Ada L4, L40, and L40S.
Removal of M, N, and Batch Size Limitations: Of cuBLASLt matmul API.
Improved Performance of Heuristics Cache: For workloads with high eviction rate.
cuBLAS Symbols: Available in CUDA Toolkit symbols for Linux repository.

Table: Comparison of Grouped GEMM APIs

API	Precision	Variable Shapes	Transpositions	Scaling Factors
cublasgemmGroupedBatched	FP32, FP64	Yes	Yes	Yes
cublasGemmGroupedBatchedEx	FP16, BF16, FP32, FP64	Yes	Yes	Yes

Table: Performance Comparison on NVIDIA GPUs

GPU	Llama 2 70B	GPT3
H200	3x speedup	5x speedup
H100	2x speedup	3x speedup
L40S	1.5x speedup	2x speedup

Conclusion

NVIDIA’s cuBLAS 12.5 introduces significant updates to enhance deep learning and high-performance computing workloads. The introduction of Grouped GEMM APIs, improved matrix multiplication performance on NVIDIA Hopper and Ada GPUs, and enhanced performance tuning options make cuBLAS 12.5 a powerful tool for developers. By leveraging these updates, developers can achieve faster and more efficient deep learning and HPC workloads.

Summary#

Introduction to Grouped GEMM APIs#

Key Features of Grouped GEMM APIs#

New APIs for Grouped GEMM Support#

Latest LLM Matmul Performance on NVIDIA GPUs#

Performance Tuning and Benchmarking#

Enhanced Functionality in cuBLASLt#

Table: Comparison of Grouped GEMM APIs#

Table: Performance Comparison on NVIDIA GPUs#

Conclusion#