Summary

The NVIDIA NeMo framework has introduced new capabilities to accelerate custom video foundation model pipelines. This includes high-throughput data curation, efficient multimodal data loading, scalable model training, and parallelized in-framework inference. These features enable developers to pretrain and fine-tune video foundation models effectively and efficiently.

Building Custom Video Foundation Models with NVIDIA NeMo

Introduction

Generative AI has evolved significantly, moving from text-based models to multimodal models, including video. This expansion into video opens up new potential uses across various industries such as robotics, autonomous vehicles, and entertainment. However, developing video foundation models presents unique challenges due to the vast and varied nature of video data. This underscores the necessity of scalable pipelines for curating data and training models that can comprehend temporal and spatial dynamics.

High-Throughput Video Curation

The NVIDIA NeMo framework includes NeMo Curator, which improves generative AI model accuracy by efficiently processing and preparing high-quality data, including large video datasets. NeMo Curator uses scalable data pipelines to clip, annotate, and filter videos efficiently. Key features include:

  • Autobalancing Techniques: NeMo Curator can leverage heterogeneous clusters with multiple GPU types to optimize performance.
  • Clipping Pipeline: Decodes and splits raw videos into short, continuous clips, which are then transcoded to high-quality video encoding (H264) and annotated with video embeddings and captions.
  • Sharding: Generates text embeddings for captions to create the final WebDataset used for training.

Scalable Video Foundation Model Training

The NeMo framework supports various model parallelism techniques, including:

  • Autoregressive Models: Reuses the well-established suite of NeMo tools on large language models (LLMs).
  • Diffusion Models: Newly added support for diffusion transformers such as DiT, MovieGen, and NVIDIA Cosmos world foundation models.
  • Model FLOPs Utilization: The NeMo tech stack is highly optimized, providing more than 40% Model FLOPs utilization in the latest benchmark.

Efficient In-Framework Inference

The NeMo framework accelerates inference by distributing denoising operations across multiple GPUs through context parallelism. Key features include:

  • Parallel Denoising: Latent tensors are combined to reconstruct the video sequence before decoding with the Cosmos video tokenizer.
  • Performance Improvements: Benchmarks show 80–90% scaling efficiency on up to 32 H100 GPUs, with FP8 Multi-Head Attention providing significant performance improvements.

Overview of Video Diffusion Pipeline

The video diffusion training pipeline includes:

  • Tokenization: Uses a causal temporal 3D tokenizer to generate 3D spatio-temporal tokens.
  • Transformer Decoder: Conditioned by the diffusion noise schedule timestep and text input, with additional Root Mean Square Layer Normalization (RMSNorm) to stabilize diffusion training.
  • Diffusion Loss: Computes the diffusion loss using the parallelized EDM diffusion pipeline.

Table: GPU Utilization and Throughput Benchmark for NVIDIA NeMo Framework on Diffusion Transformers (DiT)

Model Size Context Length Training Config GPU Used (TFLOPS/s) Throughput (token/s/GPU)
DiT 7B 8k Baseline, no optimization OOM -
DiT 7B 8k CP=2 457 8,969
DiT 7B 74k TP=4 SP CP=4 414 2,933
DiT 28B 8k TP=2 SP PP=2 435 2,392
DiT 28B 74k TP=8 SP CP=4 PP=4 411 994

Legend:

  • CP: Context parallelism
  • TP: Tensor parallelism
  • SP: Sequence parallelism
  • PP: Pipeline parallelism
  • OOM: Out of memory

Getting Started with NeMo

Developers can access the NeMo framework and start building custom video foundation models today. The NeMo Curator early access program allows for efficient video data curation, tokenization, pre-training, fine-tuning, and multi-GPU in-framework inference. For more information, visit the NVIDIA NeMo framework documentation and explore the NVIDIA Cosmos world foundation models.

Conclusion

The NVIDIA NeMo framework provides comprehensive capabilities for building custom video foundation models. With high-throughput data curation, scalable model training, and efficient in-framework inference, developers can effectively and efficiently pretrain and fine-tune video foundation models. The NeMo framework is designed to support various industries in leveraging the potential of video models.