Summary
NVIDIA has announced TensorRT 8, a significant update to its deep learning inference platform. This new version slashes BERT-Large inference latency down to 1.2 milliseconds, thanks to new optimizations. TensorRT 8 also delivers 2x the accuracy for INT8 precision with Quantization Aware Training and significantly higher performance through support for Sparsity, introduced in Ampere GPUs.
Accelerating BERT Inference with TensorRT 8
TensorRT is an SDK for high-performance deep learning inference that includes an inference optimizer and runtime. It delivers low latency and high throughput, making it a crucial tool across various industries such as Healthcare, Automotive, Manufacturing, Internet/Telecom services, Financial Services, and Energy.
Key Features of TensorRT 8
BERT Inference in 1.2 ms
TensorRT 8 brings BERT-Large inference latency down to 1.2 milliseconds with new transformer optimizations. This is a significant improvement that makes real-time natural language understanding more feasible.
INT8 Precision with Quantization Aware Training
TensorRT 8 achieves accuracy equivalent to FP32 with INT8 precision using Quantization Aware Training. This means that developers can enjoy the benefits of lower precision without sacrificing accuracy.
Sparsity Support
TensorRT 8 introduces Sparsity support for faster inference on Ampere GPUs. This feature leverages the sparse nature of neural networks to reduce computational requirements, leading to faster inference times.
Real-World Applications
WeChat’s Success Story
WeChat, one of the biggest social media platforms in China, has successfully implemented TensorRT and INT8 QAT-based model inference acceleration. This has resulted in a significant reduction (70%) in allocated computational resources, making it possible to fully integrate BERT/Transformer models into their solution.
How TensorRT 8 Works
TensorRT 8 includes several new features and optimizations that make it an ideal choice for accelerating deep learning inference:
- Generalized Optimizations: New transformer-based models can be accelerated with TensorRT 8, reducing inference time by half compared to TensorRT 7.
- Quantization Aware Training: This feature allows for INT8 precision with accuracy equivalent to FP32, making it possible to deploy models with lower computational requirements without sacrificing performance.
- Sparsity: TensorRT 8 supports Sparsity, which was introduced in Ampere GPUs. This feature significantly improves inference performance by leveraging the sparse nature of neural networks.
Getting Started with TensorRT 8
TensorRT 8 is freely available to members of the NVIDIA Developer Program. Here are some steps to get started:
- Visit the TensorRT Product Page: Learn more about TensorRT 8 and its features.
- Real-Time Natural Language Understanding with BERT Using TensorRT: Explore how to create a simple question answering application using Python, powered by TensorRT-optimized BERT code.
- Quantization Aware Training with TensorRT: Understand how to achieve FP32 accuracy for INT8 inference using Quantization Aware Training with TensorRT 8.
- Accelerating Inference with Sparsity: Learn how to use Sparsity with Ampere Architecture and TensorRT to accelerate inference.
- TensorRT Quick Start Guide: Follow this guide to quickly get started with TensorRT 8.
Conclusion
TensorRT 8 is a significant update that brings BERT-Large inference latency down to 1.2 milliseconds, making real-time natural language understanding more feasible. With its new optimizations, support for Sparsity, and Quantization Aware Training, TensorRT 8 is an essential tool for developers looking to accelerate deep learning inference. Whether you’re working in Healthcare, Automotive, or Financial Services, TensorRT 8 can help you achieve faster and more accurate inference times.