Summary
Training localized multilingual large language models (LLMs) is crucial for AI systems to understand and communicate in diverse languages. NVIDIA NeMo provides a comprehensive platform for developing custom generative AI, including tools for training, retrieval-augmented generation, guardrailing, and data curation. This article explores the best practices for adding new language support to base LLMs using NeMo, focusing on training and merging a multilingual tokenizer and performing continual pretraining.
Building Localized Multilingual LLMs with NVIDIA NeMo
In today’s globalized world, the ability of AI systems to understand and communicate in diverse languages is increasingly important. Large language models (LLMs) have revolutionized the field of natural language processing, enabling AI to generate human-like text, answer questions, and perform various language tasks. However, most mainstream LLMs are trained on data corpora that primarily consist of English, limiting their applicability to other languages and cultural contexts.
Challenges in Multilingual LLMs
Current state-of-the-art LLMs often struggle with Southeast Asian (SEA) languages due to limited training data and the unique linguistic characteristics of these languages. This results in lower performance compared to high-resource languages like English. While some LLMs can handle certain SEA languages to an extent, they still exhibit inconsistencies, hallucinations, and safety issues.
NVIDIA NeMo: A Solution for Localized Multilingual LLMs
NVIDIA NeMo is an end-to-end platform for developing custom generative AI, anywhere. It includes tools for training, retrieval-augmented generation (RAG), guardrailing, and data curation, offering enterprises an easy, cost-effective, and fast way to adopt generative AI.
Workflow for Adding New Language Support
To construct a multilingual LLM, you have several options:
- Use a multilingual dataset to pretrain the LLM from scratch.
- Adopt continual pretraining on English foundation models using a dataset of the target language.
Without a sufficiently expressive tokenizer, the model struggles to represent the low-resource language efficiently, leading to suboptimal performance. It’s necessary to build a customized tokenizer that enables the model to process and learn from the low-resource language data more effectively during continual pretraining.
Steps for Training Localized Multilingual LLMs
- Download and extract the GPT model to obtain model weights and the model tokenizer.
- Customize the tokenizer training and merge to output a bilingual tokenizer.
- Modify the GPT model architecture to accommodate the bilingual tokenizer.
- Perform continual pretraining using the target language dataset.
Data Collection and Cleaning
For this tutorial, use the NVIDIA NeMo Curator repository on GitHub to download and curate high-quality target language data. NVIDIA NeMo Curator consists of a collection of scalable data-mining modules for curating NLP data for training LLMs. The modules within NeMo Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora.
Curation Pipeline
- Separate languages to filter non-target content.
- Reformat documents to rectify any Unicode.
- Perform document-level exact deduplication and fuzzy deduplication to remove duplicated data points.
- Perform document-level heuristic filtering to remove low-quality documents.
Example: Training a Localized Multilingual LLM with Thai Wikipedia Data
In this example, we use Thai Wikipedia data to continually pretrain a GPT-1.3B model. We focus on training and merging a multilingual tokenizer and then discuss adopting the customized tokenizer in NeMo models and performing continual pretraining.
Benefits of Localized Multilingual LLMs
Localized multilingual LLMs can bridge the language gap and unlock the potential of AI for a broader audience. They can answer questions in multiple languages, write high-quality code, solve complex reasoning problems, and use tools out-of-the-box or in a zero-shot way.
Table: Comparison of LLMs
Model | Parameters | Context Window | Multilingual Support |
---|---|---|---|
Llama 3 | 405B | 128K | Yes |
GPT-4 | 340B | 8K | Limited |
Mistral | 7B | 8K | Yes |
Gemma | 9B | 8K | Yes |
Table: Performance Metrics
Model | MMLU (5-shot) | MMLU (0-shot, CoT) |
---|---|---|
Llama 3 | 87.3 | 88.6 |
GPT-4 | 89.1 | 88.7 |
Mistral | 61.1 | 60.5 |
Gemma | 72.3 | 72.3 |
Table: Language Support
Model | Languages Supported |
---|---|
Llama 3 | Over 30 languages |
GPT-4 | Limited |
Mistral | Multiple languages |
Gemma | Multiple languages |
Table: Training Data
Model | Training Data |
---|---|
Llama 3 | 15.6T tokens |
GPT-4 | Not specified |
Mistral | Not specified |
Gemma | Not specified |
Table: Continual Pretraining
Model | Continual Pretraining |
---|---|
Llama 3 | Yes |
GPT-4 | Not specified |
Mistral | Not specified |
Gemma | Not specified |
Conclusion
Training localized multilingual LLMs with NVIDIA NeMo is a crucial step towards making AI systems more inclusive and effective in diverse linguistic contexts. By following the steps outlined in this article, developers can create customized tokenizers and perform continual pretraining to enhance the performance of LLMs in low-resource languages. This approach not only improves the accuracy and robustness of LLMs but also opens up new possibilities for AI applications in various cultural and linguistic contexts.