Summary
Federated learning is revolutionizing the development of autonomous vehicles (AVs), particularly in cross-country scenarios where diverse data sources and conditions are crucial. Unlike traditional machine learning methods that require centralized data storage, federated learning enables AVs to collaboratively train algorithms using locally collected data while keeping the data decentralized. This approach enhances privacy and security, as sensitive data never leaves the country, and improves the robustness of the models by incorporating a wide range of driving environments and situations.
Federated Learning in Autonomous Vehicles: A Game-Changer for Cross-Border Training
Motivation and Use Cases
The NVIDIA AV team operates globally, collecting data from diverse regions to advance AV initiatives. To train models—particularly for tasks such as object detection, parking, and sign detection—considering the complexity of handling data from multiple countries is essential. Developing separate AV models for each country would multiply approval processes, increasing costs and delays. Instead, building a unified global model that meets or exceeds the metrics of individual country-specific models is more efficient.
AV Federated Learning Deployment Setup
The deployment consists of two federated learning clients and a central server. The clients run on different machine learning training systems, while the FL server is hosted on AWS in Japan. A development FL server instance in Hong Kong is maintained for testing and ongoing development.
AV Federated Learning Platform
The AV federated learning platform consists of several subsystems:
-
Integration with Existing AV Machine Learning Training System (MegLev: NDAS)
- The platform integrates with the local machine learning infrastructure (MAGLEV), which is unaware of NVFLARE, with its training infrastructure, called NDAS.
- The NVFLARE third-party integration feature is used to enable local training to continue within the MAGLEV framework while transferring model parameters to the NVIDIA FLARE client.
-
Job Orchestration Service
- A suite of front-end and back-end services is developed to streamline the creation and monitoring of federated learning jobs.
- This system simplifies the user experience, enabling efficient initiation of jobs and seamless tracking of their progress.
-
Federated Learning Engine with NVIDIA FLARE
- The platform uses NVIDIA FLARE, an open-source federated learning framework, to train deep learning models on country-specific data without needing direct access to raw datasets.
Challenges in Cross-Border Training
-
IT Setup
- Training data resides in an externally managed private cloud data center, requiring multiple approvals for configuration changes.
- The FL server is hosted on the AWS public cloud, and the FL client in the private cloud communicates through a network path that connects to a community cloud before reaching the public AWS cloud.
-
Network Bandwidth
- Training can be slow due to multiple concurrent jobs and the large size of the models being transferred.
- Expanding bandwidth and reducing bandwidth usage are explored, including addressing model size inflation caused by unnecessary conversions.
-
Network Outages
- Training sessions can last for days or weeks, and periodic network outages can cause client training jobs to terminate without apparent reason.
- The NVFLARE team implemented a mechanism to recover from temporary network outages and resume training.
Project Status
-
Deployment and Production
- The AV federated learning platform has been successfully implemented with the deployment of version 2.0, which has been in production for over a year.
- A dozen AV models have been trained and released, most of which demonstrate equivalent or superior metrics compared to those trained locally.
-
User Adoption
- The number of data scientists using this platform has increased significantly, growing from just 2 individuals a year ago to approximately 30 today.
Example Models
- DoNET: A model that detects the status of vehicles, such as lamp status and door status.
- WaitNet: A model designed to detect static objects, including traffic lights, traffic signs, road markings, stop lines, and crosswalks.
- PathNet/RoadNet: A model that identifies the path the AV takes.
- RadarNet: A model that uses radar sensor data to predict obstacles around the vehicle.
- PredictionNet: A model for object tracking and trajectory prediction.
- EGM: A multi-camera input model for detecting barriers in parking lots, such as height limit poles and pillars.
Conclusion
Federated learning is a decentralized AI technology that enables model training without the need to move data, ensuring regulatory compliance and minimizing costs. The AV-NVIDIA FLARE system is designed for autonomous vehicles, highlighting its framework and solutions that can also be effectively applied to other industries. By leveraging federated learning, AVs can adapt and optimize their performance across different terrains, climates, and traffic regulations, ensuring safer and more reliable autonomous driving experiences.