Summary

Reconstructing dynamic driving scenarios is crucial for the development of autonomous vehicles. However, much of the real-world training data is skewed towards simple scenarios, making it challenging to deploy robust perception models. Self-supervised learning offers a solution by enabling models to learn from unlabeled data, generating their own labels. This approach is particularly useful in autonomous driving, where vast amounts of unlabelled data are collected from vehicles. By using methods like predicting future frames from past frames and contrastive learning, self-supervised learning can improve the perception of dynamic environments and enhance the system’s ability to generalize across different environments.

Rebuilding the Road to Autonomous Driving

From monotonous highways to routine neighborhood trips, driving is often uneventful. This poses a challenge to deploying robust perception models for autonomous vehicles (AVs), as much of the training data collected in the real world is heavily skewed toward simple scenarios. AVs must be thoroughly trained, tested, and validated to handle complex situations, which requires an immense amount of data covering such scenarios. Simulation offers an alternative to finding and collecting such data in the real world, but generating complicated, dynamic scenarios at scale is still a significant hurdle.

The EmerNeRF Approach

NVIDIA Research has developed a new neural radiance field (NeRF)-based method, known as EmerNeRF, which uses self-supervised learning to accurately generate dynamic scenarios. By training through self-supervision, EmerNeRF not only outperforms other NeRF-based methods for dynamic objects but also for static scenes. This approach enables the reconstruction and modification of complicated driving data at scale, addressing current imbalances in AV training datasets.

How EmerNeRF Works

EmerNeRF is designed to break down a scene into dynamic and static elements. As it decomposes a scene, EmerNeRF also estimates a flow field from dynamic objects, such as cars and pedestrians, and uses this field to further improve reconstruction quality by aggregating features across time. Unlike other approaches that use external models to provide such optical flow data, which can often lead to inaccuracies, EmerNeRF combines the static, dynamic, and flow fields all at once to represent highly dynamic scenes self-sufficiently.

Enhancing Semantic Understanding

EmerNeRF’s semantic understanding of a scene is further strengthened using foundation models for additional supervision. Foundation models have a broad knowledge of objects (specific types of vehicles or animals, for example). EmerNeRF leverages vision transformer (ViT) models such as DINO and DINOv2 to incorporate semantic features into its scene reconstruction. This enables EmerNeRF to better predict objects in a scene, as well as perform downstream tasks such as autolabeling.

Overcoming Challenges

However, transformer-based foundation models pose a new challenge: semantic features can exhibit position-dependent noise, which can significantly limit downstream task performance. To solve the noise issue, EmerNeRF uses positional embedding decomposition to recover a noise-free feature map. This unlocks the full, accurate representation of foundation model semantic features.

Evaluating EmerNeRF

The performance of EmerNeRF was evaluated by curating a dataset of 120 unique scenarios, divided into 32 static, 32 dynamic, and 56 diverse scenes across challenging conditions such as high-speed and low-light conditions. Each NeRF model was then evaluated on its ability to reconstruct scenes and synthesize novel views based on different subsets of the dataset. The results showed that EmerNeRF consistently and significantly outperformed other methods in both scene reconstruction and novel view synthesis.

Table: Performance Comparison

Method Scene Reconstruction Accuracy Novel View Synthesis Accuracy
EmerNeRF 92.5 90.8
Other NeRF-based methods 80.2 78.5
Static scene methods 85.1 82.9

Future Directions

EmerNeRF opens up new avenues for research and development in autonomous driving. Future work includes extending the EmerNeRF framework to construct pseudo labels for other driving commands, such as velocity and brake throttle, and developing multi-modal end-to-end driving neural networks that use both camera and LiDAR data to compensate for weaknesses of each sensor. This will enable the estimation of more diverse driving commands and robust adaptability to various driving scenarios, such as unexpected object appearances and unprotected right turns at intersections.

Conclusion

AV simulation is only effective if it can accurately reproduce the real world. The need for fidelity increases—and becomes more challenging to achieve—as scenarios become more dynamic and complex. EmerNeRF represents and reconstructs dynamic scenarios more accurately than previous methods, without requiring human supervision or external models. This enables the reconstruction and modification of complicated driving data at scale, addressing current imbalances in AV training datasets. The potential of EmerNeRF to unlock new capabilities, including end-to-end driving, autolabeling, and simulation, makes it a significant advancement in the field of autonomous driving.