Summary
The NVIDIA NeMo framework has introduced new capabilities to accelerate custom video foundation model pipelines. This includes high-throughput data curation, efficient multimodal data loading, scalable model training, and parallelized in-framework inference. These features enable developers to pretrain and fine-tune video foundation models effectively and efficiently.
Building Custom Video Foundation Models with NVIDIA NeMo
Introduction
Generative AI has evolved significantly, moving from text-based models to multimodal models, including video. This expansion into video opens up new potential uses across various industries such as robotics, autonomous vehicles, and entertainment. However, developing video foundation models presents unique challenges due to the vast and varied nature of video data. This underscores the necessity of scalable pipelines for curating data and training models that can comprehend temporal and spatial dynamics.
High-Throughput Video Curation
The NVIDIA NeMo framework includes NeMo Curator, which improves generative AI model accuracy by efficiently processing and preparing high-quality data, including large video datasets. NeMo Curator uses scalable data pipelines to clip, annotate, and filter videos efficiently. Key features include:
- Autobalancing Techniques: NeMo Curator can leverage heterogeneous clusters with multiple GPU types to optimize performance.
- Clipping Pipeline: Decodes and splits raw videos into short, continuous clips, which are then transcoded to high-quality video encoding (H264) and annotated with video embeddings and captions.
- Sharding: Generates text embeddings for captions to create the final WebDataset used for training.
Scalable Video Foundation Model Training
The NeMo framework supports various model parallelism techniques, including:
- Autoregressive Models: Reuses the well-established suite of NeMo tools on large language models (LLMs).
- Diffusion Models: Newly added support for diffusion transformers such as DiT, MovieGen, and NVIDIA Cosmos world foundation models.
- Model FLOPs Utilization: The NeMo tech stack is highly optimized, providing more than 40% Model FLOPs utilization in the latest benchmark.
Efficient In-Framework Inference
The NeMo framework accelerates inference by distributing denoising operations across multiple GPUs through context parallelism. Key features include:
- Parallel Denoising: Latent tensors are combined to reconstruct the video sequence before decoding with the Cosmos video tokenizer.
- Performance Improvements: Benchmarks show 80–90% scaling efficiency on up to 32 H100 GPUs, with FP8 Multi-Head Attention providing significant performance improvements.
Overview of Video Diffusion Pipeline
The video diffusion training pipeline includes:
- Tokenization: Uses a causal temporal 3D tokenizer to generate 3D spatio-temporal tokens.
- Transformer Decoder: Conditioned by the diffusion noise schedule timestep and text input, with additional Root Mean Square Layer Normalization (RMSNorm) to stabilize diffusion training.
- Diffusion Loss: Computes the diffusion loss using the parallelized EDM diffusion pipeline.
Table: GPU Utilization and Throughput Benchmark for NVIDIA NeMo Framework on Diffusion Transformers (DiT)
Model Size | Context Length | Training Config | GPU Used (TFLOPS/s) | Throughput (token/s/GPU) |
---|---|---|---|---|
DiT 7B | 8k | Baseline, no optimization | OOM | - |
DiT 7B | 8k | CP=2 | 457 | 8,969 |
DiT 7B | 74k | TP=4 SP CP=4 | 414 | 2,933 |
DiT 28B | 8k | TP=2 SP PP=2 | 435 | 2,392 |
DiT 28B | 74k | TP=8 SP CP=4 PP=4 | 411 | 994 |
Legend:
- CP: Context parallelism
- TP: Tensor parallelism
- SP: Sequence parallelism
- PP: Pipeline parallelism
- OOM: Out of memory
Getting Started with NeMo
Developers can access the NeMo framework and start building custom video foundation models today. The NeMo Curator early access program allows for efficient video data curation, tokenization, pre-training, fine-tuning, and multi-GPU in-framework inference. For more information, visit the NVIDIA NeMo framework documentation and explore the NVIDIA Cosmos world foundation models.
Conclusion
The NVIDIA NeMo framework provides comprehensive capabilities for building custom video foundation models. With high-throughput data curation, scalable model training, and efficient in-framework inference, developers can effectively and efficiently pretrain and fine-tune video foundation models. The NeMo framework is designed to support various industries in leveraging the potential of video models.