Unlocking High Performance on NVIDIA GPUs: Full-Stack Optimizations for Llama 3.2

Summary: NVIDIA has optimized the Llama 3.2 collection of models to deliver high throughput and low latency across millions of GPUs worldwide. This article explores the full-stack optimizations that enable high performance on NVIDIA GPUs, from data centers to local workstations and edge devices.

Delivering High Throughput and Low Latency

The Llama 3.2 series of vision language models (VLMs) come in 11B parameter and 90B parameter variants, supporting both text and image inputs. To enable low latency responses and high throughput, NVIDIA has optimized the Llama 3.2 models at every layer of the technology stack.

Optimizations for High-Performance Inference

NVIDIA has accelerated the Llama 3.2 model collection using TensorRT and TensorRT-LLM libraries for high-performance deep learning inference. The Llama 3.2 1B and 3B models have been optimized for long-context support using the scaled rotary position embedding (RoPE) technique and several other optimizations, including KV caching and in-flight batching.

Multimodal Model Acceleration

The Llama 3.2 11B and 90B models are multimodal, including a vision encoder with a text decoder. To accelerate these models, NVIDIA exports the model into an ONNX graph and builds the TensorRT engine. ONNX export creates a standard model definition with built-in operators and standard data types, focused on inferencing. TensorRT uses the ONNX graph to optimize the model for target GPUs by building the TensorRT engine.

Throughput Performance on NVIDIA GPUs

Table 1 shows the maximum throughput performance of the Llama 3.2 90B model on eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, connected through NVLink and NVLink Switch.

Input Sequence Length Output Sequence Length Maximum Throughput Performance (tokens/second)
100 100 12,800
100 2,000 10,400
100 4,000 8,600

Throughput Performance on GeForce RTX 4090

For Windows deployments, NVIDIA has optimized the Llama 3.2 SLMs to work efficiently using the ONNX Runtime Generative API with a DirectML backend. Table 2 shows the maximum throughput performance of the Llama 3.2 3B model on NVIDIA GeForce RTX 4090 GPUs.

Input Sequence Length Output Sequence Length Batch Size Maximum Throughput Performance (tokens/second)
100 100 1 253
100 100 4 615
100 2,000 1 203
100 2,000 4 374
100 4,000 1 165
100 4,000 4 251

Deploying Accelerated Llama 3.2 Models

NVIDIA provides a range of deployment options for accelerated Llama 3.2 models, from data centers to local workstations and edge devices. The NVIDIA AI Enterprise software platform offers NVIDIA TensorRT optimized inference on Llama 3.2 and other models from NVIDIA and its partner ecosystem.

Edge Deployment on NVIDIA Jetson

For edge deployment, NVIDIA provides optimized Llama 3.2 SLMs with INT4/FP8 quantization, which can be deployed on NVIDIA Jetson devices using the SLM Tutorial on NVIDIA Jetson AI Lab.

Conclusion

NVIDIA’s full-stack optimizations have unlocked high performance on NVIDIA GPUs for the Llama 3.2 collection of models. With accelerated inference and optimized deployment options, developers can deliver high throughput and low latency applications across millions of GPUs worldwide. Whether deploying in data centers, local workstations, or edge devices, NVIDIA’s optimized Llama 3.2 models provide an optimal end-user experience.