Summary

Curating large-scale datasets is crucial for pretraining large language models (LLMs). NVIDIA NeMo Data Curator is a scalable data-curation tool that enables the creation of trillion-token multilingual datasets. This article explores the capabilities of NeMo Data Curator, its benefits, and how it can be used to prepare high-quality datasets for LLM pretraining.

Simplifying Trillion-Token Dataset Curation with NVIDIA NeMo Data Curator

The demand for large-scale datasets for pretraining LLMs is growing rapidly. To meet this need, NVIDIA has developed and released NeMo Data Curator, a scalable data-curation tool that enables the creation of trillion-token multilingual datasets. This tool is part of the NeMo framework and is designed to streamline the data-curation process, making it faster and more efficient.

Key Features of NeMo Data Curator

NeMo Data Curator offers several key features that make it an ideal tool for curating large-scale datasets:

  • Scalability: NeMo Data Curator can scale to thousands of compute cores, allowing for the rapid processing of large datasets.
  • Flexibility: The tool provides workflows to download and curate data from various public sources, such as Common Crawl, Wikipedia, and arXiv. It also allows developers to customize data curation pipelines to address unique requirements.
  • GPU Acceleration: With GPU acceleration, organizations can achieve 20x faster and 5x cheaper deduplication than CPU-based methods.

Data Curation Pipeline

The data curation pipeline in NeMo Data Curator includes several stages:

  • Data Download: Downloading data from public sources or private repositories.
  • Text Extraction: Extracting text from downloaded data.
  • Text Reformatting and Cleaning: Cleaning and reformatting text data to ensure consistency and quality.
  • Quality Filtering: Applying heuristic filters to remove low-quality data.
  • Deduplication: Removing duplicate records to ensure dataset uniqueness.

Case Study: Curating a 2T Token Dataset

NVIDIA used NeMo Data Curator to prepare a 2T token dataset for pretraining a 43B-parameter multilingual large foundation model. The dataset included 53 natural languages and 37 programming languages, and was curated from 8.7 TB of text data using a CPU cluster of more than 6K CPUs.

Benefits of NeMo Data Curator

NeMo Data Curator offers several benefits for LLM developers:

  • Improved Data Quality: The tool ensures high-quality data by applying rigorous cleaning and filtering processes.
  • Faster Processing: With GPU acceleration and scalability, NeMo Data Curator can process large datasets much faster than traditional methods.
  • Customization: Developers can customize data curation pipelines to meet specific requirements.

Table: Comparison of Deduplication Methods

Method Time Cost
CPU-Based Deduplication 37 hours High
GPU-Accelerated Deduplication 3 hours Low

Table: Benefits of NeMo Data Curator

Benefit Description
Improved Data Quality Ensures high-quality data through rigorous cleaning and filtering processes.
Faster Processing Processes large datasets much faster than traditional methods with GPU acceleration and scalability.
Customization Allows developers to customize data curation pipelines to meet specific requirements.

Table: Case Study Results

Dataset Size Number of Languages Processing Time
2T tokens 53 natural languages, 37 programming languages Several hours with NeMo Data Curator

Conclusion

NVIDIA NeMo Data Curator is a powerful tool for curating large-scale datasets for LLM pretraining. Its scalability, flexibility, and GPU acceleration make it an ideal solution for organizations looking to improve the accuracy and efficiency of their LLMs. By using NeMo Data Curator, developers can create high-quality datasets that lead to better downstream performance and faster model convergence.