Accelerating Leaderboard-Topping ASR Models 10x with NVIDIA NeMo

Summary

NVIDIA NeMo has made significant strides in accelerating automatic speech recognition (ASR) models, achieving up to 10x speed improvements. This article delves into the key enhancements that enabled these advancements, including autocasting tensors to bfloat16, the innovative label-looping algorithm, and the introduction of CUDA Graphs available with NeMo 2.0.0.

Overcoming Speed Performance Bottlenecks

NVIDIA NeMo ASR models faced several speed performance bottlenecks, including casting overheads, low compute intensity, and divergence performance issues. To address these challenges, NVIDIA implemented several key enhancements.

Autocasting Tensors to `bfloat16`

One major bottleneck was the overhead of casting tensors from float32 to bfloat16. This process added significant latency, as shown in Figure 2, where casting before a matrix multiplication in the Parakeet CTC 1.1B model added 200 microseconds of overhead, while the matrix multiplication itself only took 200 microseconds. To resolve this, NVIDIA implemented full half-precision inference, performing operations fully in bfloat16 or float16, which eliminates unnecessary casting overhead without compromising accuracy.

Label-Looping Algorithm

Another bottleneck was the sequential execution of certain operations, such as CTC greedy decoding and feature normalization. By switching to fully batched processing, NVIDIA achieved a 10% increase in throughput for each operation, resulting in an overall speedup of approximately 20%. The introduction of the batched label-looping algorithm significantly increased efficiency for both RNN-T and TDT networks, enabling much faster decoding.

CUDA Graphs

The introduction of CUDA Graphs with NeMo 2.0.0 further accelerated ASR models. CUDA Graphs allow for the efficient execution of complex workflows by capturing the entire computation graph, including memory allocations and data transfers, into a single executable graph. This reduces the overhead of launching individual CUDA kernels, leading to significant performance improvements.

Cost Comparison: CPU vs. GPU

To illustrate the benefits of ASR GPU-based inference, NVIDIA estimated the cost of transcribing 1 million hours of speech using both CPUs and NVIDIA GPUs. The comparison used the NVIDIA Parakeet RNN-T 1.1B model and showed that GPU-based inference is significantly more cost-effective.

CPU-Based Estimation

For the CPU estimation, NVIDIA ran NeMo ASR with a batch size of 1 on a single pinned CPU core. This method, a common industry practice, allows for linear scaling across multiple cores while maintaining a constant real-time factor (RTFx). The estimation used an AMD EPYC 9454 CPU with a measured RTFx of 4.5, available via Amazon EC2 C7a compute-optimized instances.

GPU-Based Estimation

For the GPU estimation, NVIDIA used the results from the Hugging Face Open ASR Leaderboard, which were run on NVIDIA A100 80GB GPUs. The equivalent AWS instance is p4de.24xlarge, featuring 8x NVIDIA A100 80GB GPUs.

Cost Calculation

To calculate the total cost for both CPU and GPU, NVIDIA considered the following factors:

CPU: The cost of running NeMo ASR on a single CPU core, scaled linearly across multiple cores.
GPU: The cost of running NeMo ASR on NVIDIA A100 80GB GPUs, using the results from the Hugging Face Open ASR Leaderboard.

The comparison showed that GPU-based inference is significantly more cost-effective, highlighting the benefits of using NVIDIA GPUs for ASR tasks.

Table: Key Enhancements and Their Impact

Enhancement	Impact
Autocasting tensors to `bfloat16`	Eliminates unnecessary casting overhead, improving speed without compromising accuracy.
Label-looping algorithm	Increases efficiency for RNN-T and TDT networks, enabling faster decoding.
CUDA Graphs	Reduces the overhead of launching individual CUDA kernels, leading to significant performance improvements.

Table: Cost Comparison

Platform	Cost
CPU (AMD EPYC 9454)	Higher cost due to linear scaling across multiple cores.
GPU (NVIDIA A100 80GB)	Lower cost due to efficient execution of complex workflows with CUDA Graphs.

These advancements underscore the importance of optimizing ASR models for both speed and accuracy, making NVIDIA NeMo a leading platform for developing high-performance ASR solutions.

Conclusion

NVIDIA NeMo has achieved remarkable speed improvements in ASR models, up to 10x, through key enhancements such as autocasting tensors to bfloat16, the innovative label-looping algorithm, and the introduction of CUDA Graphs. These advancements not only accelerate ASR models but also make them more cost-effective, as demonstrated by the comparison between CPU and GPU-based inference. With these improvements, NVIDIA NeMo continues to set the benchmark in the industry, particularly with models topping the Hugging Face Open ASR Leaderboard.

Summary#

Overcoming Speed Performance Bottlenecks#

Autocasting Tensors to bfloat16#

Label-Looping Algorithm#

CUDA Graphs#

Cost Comparison: CPU vs. GPU#

CPU-Based Estimation#

GPU-Based Estimation#

Cost Calculation#

Table: Key Enhancements and Their Impact#

Table: Cost Comparison#

Conclusion#