NVIDIA TensorRT 8 Slashes BERT-Large Inference Down to 1 Millisecond

Summary

NVIDIA has announced TensorRT 8, a significant update to its deep learning inference platform. This new version slashes BERT-Large inference latency down to 1.2 milliseconds, thanks to new optimizations. TensorRT 8 also delivers 2x the accuracy for INT8 precision with Quantization Aware Training and significantly higher performance through support for Sparsity, introduced in Ampere GPUs.

Accelerating BERT Inference with TensorRT 8

TensorRT is an SDK for high-performance deep learning inference that includes an inference optimizer and runtime. It delivers low latency and high throughput, making it a crucial tool across various industries such as Healthcare, Automotive, Manufacturing, Internet/Telecom services, Financial Services, and Energy.

Key Features of TensorRT 8

BERT Inference in 1.2 ms

TensorRT 8 brings BERT-Large inference latency down to 1.2 milliseconds with new transformer optimizations. This is a significant improvement that makes real-time natural language understanding more feasible.

INT8 Precision with Quantization Aware Training

TensorRT 8 achieves accuracy equivalent to FP32 with INT8 precision using Quantization Aware Training. This means that developers can enjoy the benefits of lower precision without sacrificing accuracy.

Sparsity Support

TensorRT 8 introduces Sparsity support for faster inference on Ampere GPUs. This feature leverages the sparse nature of neural networks to reduce computational requirements, leading to faster inference times.

Real-World Applications

WeChat’s Success Story

WeChat, one of the biggest social media platforms in China, has successfully implemented TensorRT and INT8 QAT-based model inference acceleration. This has resulted in a significant reduction (70%) in allocated computational resources, making it possible to fully integrate BERT/Transformer models into their solution.

How TensorRT 8 Works

TensorRT 8 includes several new features and optimizations that make it an ideal choice for accelerating deep learning inference:

Generalized Optimizations: New transformer-based models can be accelerated with TensorRT 8, reducing inference time by half compared to TensorRT 7.
Quantization Aware Training: This feature allows for INT8 precision with accuracy equivalent to FP32, making it possible to deploy models with lower computational requirements without sacrificing performance.
Sparsity: TensorRT 8 supports Sparsity, which was introduced in Ampere GPUs. This feature significantly improves inference performance by leveraging the sparse nature of neural networks.

Getting Started with TensorRT 8

TensorRT 8 is freely available to members of the NVIDIA Developer Program. Here are some steps to get started:

Visit the TensorRT Product Page: Learn more about TensorRT 8 and its features.
Real-Time Natural Language Understanding with BERT Using TensorRT: Explore how to create a simple question answering application using Python, powered by TensorRT-optimized BERT code.
Quantization Aware Training with TensorRT: Understand how to achieve FP32 accuracy for INT8 inference using Quantization Aware Training with TensorRT 8.
Accelerating Inference with Sparsity: Learn how to use Sparsity with Ampere Architecture and TensorRT to accelerate inference.
TensorRT Quick Start Guide: Follow this guide to quickly get started with TensorRT 8.

Conclusion

TensorRT 8 is a significant update that brings BERT-Large inference latency down to 1.2 milliseconds, making real-time natural language understanding more feasible. With its new optimizations, support for Sparsity, and Quantization Aware Training, TensorRT 8 is an essential tool for developers looking to accelerate deep learning inference. Whether you’re working in Healthcare, Automotive, or Financial Services, TensorRT 8 can help you achieve faster and more accurate inference times.

Summary#

Accelerating BERT Inference with TensorRT 8#

Key Features of TensorRT 8#

BERT Inference in 1.2 ms#

INT8 Precision with Quantization Aware Training#

Sparsity Support#

Real-World Applications#

WeChat’s Success Story#

How TensorRT 8 Works#

Getting Started with TensorRT 8#

Conclusion#