Summary: NVIDIA’s GH200 NVL32 system, powered by 32 NVIDIA GH200 Grace Hopper Superchips connected via the NVLink Switch system, significantly improves time-to-first-token (TTFT) performance for large language models (LLMs). This advancement is crucial for applications like interactive speech bots and coding assistants, where fast response times are essential. The GH200 NVL32 system demonstrates remarkable TTFT performance, even for long context lengths, making it ideal for real-time use cases.

Fast Response Times in Large Language Models: A Game-Changer

Large language models (LLMs) are revolutionizing various applications, from interactive speech bots to coding assistants. However, achieving fast response times in these models is crucial for delivering a seamless user experience. NVIDIA’s GH200 NVL32 system, featuring 32 NVIDIA GH200 Grace Hopper Superchips connected via the NVLink Switch system, is making significant strides in this area.

Time-to-First-Token (TTFT) Matters

TTFT is a critical metric for LLMs, as it measures the time taken to generate the first token in response to an input. Fast TTFT is essential for real-time use cases, where users expect immediate responses. The GH200 NVL32 system is designed to deliver exceptional TTFT performance, even for long context lengths.

NVIDIA GH200 NVL32 Supercharges TTFT

The GH200 NVL32 system connects 32 NVIDIA GH200 Grace Hopper Superchips using the NVLink Switch system. This configuration enables each Hopper GPU to communicate with any other GPU within the NVLink domain at full 900 GB/s bandwidth, resulting in 28.8 TB/s of aggregate bandwidth. This massive bandwidth allows the system to process large input sequences quickly, making it ideal for long context LLMs.

Llama 3.1 70B and 405B Performance

The GH200 NVL32 system demonstrates impressive TTFT performance for both Llama 3.1 70B and 405B models. For Llama 3.1 70B, the system achieves a TTFT of just 472 milliseconds for an input sequence length of 32,768 tokens. For Llama 3.1 405B, the system achieves a TTFT of approximately 1.6 seconds for an input sequence length of 32,768 tokens.

Llama 3.1 70B Time to First Token (milliseconds) GH200 NVL32
4,096 64
32,768 472
122,880 2,197
Llama 3.1 405B Time to First Token (milliseconds) GH200 NVL32
4,096 208
32,768 1,627
122,880 7,508

Accelerating Agentic Workflows

Agentic workflows, which involve tree search, self-reflection, and iterative inferences, require fast TTFT to deliver accurate responses. The GH200 NVL32 system is well-suited for these workflows, as it can process large input sequences quickly and efficiently.

NVIDIA Blackwell GB200 NVL72: The Future of Computing

Looking ahead, NVIDIA’s Blackwell GB200 NVL72 system promises to deliver even more impressive performance. With second-generation Transformer Engine and fifth-generation Tensor Cores, Blackwell delivers up to 20 PFLOPS of FP4 AI compute – a 5x increase over NVIDIA Hopper. The system also features fifth-generation NVLink, providing 1,800 GB/s of GPU-to-GPU bandwidth – twice that of Hopper.

Conclusion

NVIDIA’s GH200 NVL32 system is a significant step forward in achieving fast response times in large language models. With its impressive TTFT performance, even for long context lengths, this system is ideal for real-time use cases. As LLMs continue to grow in size and complexity, the need for fast and efficient inference will only increase. NVIDIA’s Blackwell GB200 NVL72 system promises to deliver even more impressive performance, making it an exciting development for the future of computing.