Curating Non-English Datasets for LLM Training with NVIDIA NeMo Curator

Summary:

Curating high-quality datasets is crucial for developing effective and fair large language models (LLMs). NVIDIA NeMo Curator is an open-source library designed to improve LLM training by providing scalable and efficient data curation. This article explores how to use NeMo Curator to curate non-English datasets for LLM training, focusing on the Thai Wikipedia dataset as an example.

Curating Non-English Datasets for LLM Training: A Guide

Large language models (LLMs) have revolutionized the field of natural language processing (NLP). However, their performance heavily depends on the quality of the training data. Curating high-quality datasets is essential for developing effective and fair LLMs. In this article, we will explore how to use NVIDIA NeMo Curator to curate non-English datasets for LLM training.

The Importance of Data Curation

Data curation plays a crucial role in the development of LLMs. High-quality training data directly impacts LLM performance, addressing issues like bias, inconsistencies, and redundancy. By curating high-quality datasets, we can ensure that LLMs are accurate, reliable, and generalizable.

NVIDIA NeMo Curator: An Overview

NVIDIA NeMo Curator is an open-source library designed to improve LLM training by providing scalable and efficient data curation. It offers a customizable and modular interface that simplifies pipeline expansion and accelerates model convergence by preparing high-quality tokens. NeMo Curator enables you to mine high-quality text at scale from massive uncurated web corpora as well as custom datasets.

Curating Non-English Datasets with NeMo Curator

In this section, we will explore how to use NeMo Curator to curate non-English datasets for LLM training. We will use the Thai Wikipedia dataset as an example.

Language Separation

The Thai Wikipedia dataset downloaded might contain documents in other languages. To retain only the Thai documents, we can perform language separation using NeMo Curator’s predefined heuristic filter FastTextLangId. This filter computes a language score and language label for each document, allowing us to filter out non-Thai documents.

from nemo_curator.filters import FastTextLangId, ScoreFilter

# Define the language filter
language_filter = FastTextLangId()

# Apply the language filter to the dataset
filtered_dataset = ScoreFilter(language_filter, threshold=0.5)

Unicode Reformatter

Data scraped from the Internet often contains various Unicode encodings and special characters that can lead to inconsistencies and errors in further processing. To standardize the text into a consistent format, we can use NeMo Curator’s DocumentModifier interface to define how documents in the dataset should be modified.

from nemo_curator.modifiers import UnicodeReformatter, DocumentModifier

# Define the Unicode reformatter
unicode_reformatter = UnicodeReformatter()

# Apply the Unicode reformatter to the dataset
modified_dataset = DocumentModifier(unicode_reformatter)

Advanced Cleaning

Data quality is crucial for LLM training performance. Advanced data curation techniques such as deduplication and heuristic filtering are often applied to yield better data quality. NeMo Curator provides a range of tools for advanced cleaning, including exact and fuzzy deduplication.

from nemo_curator.filters import ExactDeduplication, FuzzyDeduplication

# Define the deduplication filters
exact_deduplication = ExactDeduplication()
fuzzy_deduplication = FuzzyDeduplication()

# Apply the deduplication filters to the dataset
deduplicated_dataset = exact_deduplication(fuzzy_deduplication(modified_dataset))

Putting it All Together

In this section, we will put together the data curation pipeline using NeMo Curator.

from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import FastTextLangId, ScoreFilter, ExactDeduplication, FuzzyDeduplication
from nemo_curator.modifiers import UnicodeReformatter, DocumentModifier

# Define the dataset
dataset = DocumentDataset.read_json('thai_wikipedia.jsonl', add_filename=True)

# Define the language filter
language_filter = FastTextLangId()

# Apply the language filter to the dataset
filtered_dataset = ScoreFilter(language_filter, threshold=0.5)

# Define the Unicode reformatter
unicode_reformatter = UnicodeReformatter()

# Apply the Unicode reformatter to the dataset
modified_dataset = DocumentModifier(unicode_reformatter)

# Define the deduplication filters
exact_deduplication = ExactDeduplication()
fuzzy_deduplication = FuzzyDeduplication()

# Apply the deduplication filters to the dataset
deduplicated_dataset = exact_deduplication(fuzzy_deduplication(modified_dataset))

# Save the curated dataset
deduplicated_dataset.save_json('curated_thai_wikipedia.jsonl')

Table: Comparison of Data Curation Techniques

Technique	Description	Benefits
Language Separation	Filters out non-target language documents	Improves data quality and reduces bias
Unicode Reformatter	Standardizes text into a consistent format	Reduces errors and inconsistencies
Advanced Cleaning	Applies deduplication and heuristic filtering	Improves data quality and reduces redundancy

By using NeMo Curator and applying these data curation techniques, we can create high-quality datasets for LLM training and improve the performance of our models.

Conclusion

In this article, we explored how to use NVIDIA NeMo Curator to curate non-English datasets for LLM training. We used the Thai Wikipedia dataset as an example and demonstrated how to perform language separation, Unicode reformatter, and advanced cleaning using NeMo Curator. By curating high-quality datasets, we can ensure that LLMs are accurate, reliable, and generalizable.

Curating Non-English Datasets for LLM Training: A Guide#

The Importance of Data Curation#

NVIDIA NeMo Curator: An Overview#

Curating Non-English Datasets with NeMo Curator#

Language Separation#

Unicode Reformatter#

Advanced Cleaning#

Putting it All Together#

Conclusion#