Training Autonomous Vehicles: How Tensor Parallelism Can Help
Summary: Autonomous vehicles rely on complex perception models that require large amounts of GPU memory. To address this challenge, NVIDIA and NIO have collaborated on a research project that explores the use of tensor parallel convolutional neural network (CNN) training. This approach can significantly reduce the GPU memory footprint, making it more feasible to train large vision models.
The Challenge of Training Autonomous Vehicles
Autonomous driving perception tasks involve extracting features from multicamera data using convolutional neural networks (CNNs) as the backbone. The forward activations of CNNs are feature maps of shape (N, C, H, W), where N, C, H, W are the number of images, number of channels, height, and width, respectively. These activations need to be saved for backward propagation, which consumes significant memory.
Tensor Parallel CNN Training
To address this challenge, NVIDIA and NIO have designed and implemented tensor parallel CNN training. This approach slices the inputs and intermediate activations across multiple GPUs, reducing the GPU memory footprint and bandwidth pressure for individual GPUs. The model weights and optimizer states are replicated on each GPU, similar to data parallel training.
Using PyTorch DTensor
PyTorch 2.0 introduces DTensor, which provides primitives to express tensor distribution, such as sharding and replication. This enables users to perform distributed computing without explicitly calling communication operators. The underlying implementation of DTensor encapsulates communication libraries such as NVIDIA Collective Communications Library (NCCL).
Implementation
Taking the CNN model for vision tasks, ConvNeXt-XL, as an example, we can demonstrate tensor parallel CNN training using DTensor. The model parameters are replicated on each GPU, while the model inputs are sliced across multiple GPUs.
Model Parameters | Model Inputs |
---|---|
Replicate | Shard(3) |
350 million parameters, 1.4 GB of GPU memory (FP32) | Slicing the W dimension of (N, C, H, W) |
For example, sharding the input of shape (7, 3, 512, 2048) on four GPUs generates four slices of shape (7, 3, 512, 512).
Benefits of Tensor Parallelism
Using DTensor to implement tensor parallel CNN training provides several benefits:
- Improved Training Efficiency: Tensor parallelism can significantly reduce the memory footprint and maintain good scalability.
- Increased GPU Utilization: By distributing the workload across multiple GPUs, tensor parallelism can improve GPU utilization and reduce the cost of model training.
- Flexible Model Structure: Tensor parallelism enables a more flexible model structure, making it easier to train large vision models.
Benchmark Results
Benchmarks show that tensor parallelism performs well in NIO’s autonomous driving scenarios and effectively addresses the challenges of training large vision models. This approach has been tested on NIO’s Autonomous Driving Development Platform (NADP), which delivers high-performance computing and full-chain tools to process hundreds of thousands of daily inference and training tasks.
Table: Comparison of Training Methods
Training Method | GPU Memory Footprint | Scalability |
---|---|---|
Data Parallelism | High | Limited |
Tensor Parallelism | Low | Good |
Note: The table above is a simplified comparison of training methods and is not intended to be a comprehensive analysis.
By adopting tensor parallel CNN training, developers can overcome the challenges of training large vision models and create more efficient and scalable perception models for autonomous vehicles.
Conclusion
Tensor parallel CNN training is a key approach to improving the efficiency of perception model training for autonomous vehicles. By leveraging the computing power and interconnects of multiple GPUs, this approach can make perception model training more widely accessible. With its ability to reduce the GPU memory footprint and improve scalability, tensor parallelism is an essential tool for training large vision models.