Summary

Training localized multilingual large language models (LLMs) is crucial for AI systems to understand and communicate in diverse languages. NVIDIA NeMo provides a comprehensive platform for developing custom generative AI, including tools for training, retrieval-augmented generation, guardrailing, and data curation. This article explores the best practices for adding new language support to base LLMs using NeMo, focusing on training and merging a multilingual tokenizer and performing continual pretraining.

Building Localized Multilingual LLMs with NVIDIA NeMo

In today’s globalized world, the ability of AI systems to understand and communicate in diverse languages is increasingly important. Large language models (LLMs) have revolutionized the field of natural language processing, enabling AI to generate human-like text, answer questions, and perform various language tasks. However, most mainstream LLMs are trained on data corpora that primarily consist of English, limiting their applicability to other languages and cultural contexts.

Challenges in Multilingual LLMs

Current state-of-the-art LLMs often struggle with Southeast Asian (SEA) languages due to limited training data and the unique linguistic characteristics of these languages. This results in lower performance compared to high-resource languages like English. While some LLMs can handle certain SEA languages to an extent, they still exhibit inconsistencies, hallucinations, and safety issues.

NVIDIA NeMo: A Solution for Localized Multilingual LLMs

NVIDIA NeMo is an end-to-end platform for developing custom generative AI, anywhere. It includes tools for training, retrieval-augmented generation (RAG), guardrailing, and data curation, offering enterprises an easy, cost-effective, and fast way to adopt generative AI.

Workflow for Adding New Language Support

To construct a multilingual LLM, you have several options:

  • Use a multilingual dataset to pretrain the LLM from scratch.
  • Adopt continual pretraining on English foundation models using a dataset of the target language.

Without a sufficiently expressive tokenizer, the model struggles to represent the low-resource language efficiently, leading to suboptimal performance. It’s necessary to build a customized tokenizer that enables the model to process and learn from the low-resource language data more effectively during continual pretraining.

Steps for Training Localized Multilingual LLMs

  1. Download and extract the GPT model to obtain model weights and the model tokenizer.
  2. Customize the tokenizer training and merge to output a bilingual tokenizer.
  3. Modify the GPT model architecture to accommodate the bilingual tokenizer.
  4. Perform continual pretraining using the target language dataset.

Data Collection and Cleaning

For this tutorial, use the NVIDIA NeMo Curator repository on GitHub to download and curate high-quality target language data. NVIDIA NeMo Curator consists of a collection of scalable data-mining modules for curating NLP data for training LLMs. The modules within NeMo Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora.

Curation Pipeline

  1. Separate languages to filter non-target content.
  2. Reformat documents to rectify any Unicode.
  3. Perform document-level exact deduplication and fuzzy deduplication to remove duplicated data points.
  4. Perform document-level heuristic filtering to remove low-quality documents.

Example: Training a Localized Multilingual LLM with Thai Wikipedia Data

In this example, we use Thai Wikipedia data to continually pretrain a GPT-1.3B model. We focus on training and merging a multilingual tokenizer and then discuss adopting the customized tokenizer in NeMo models and performing continual pretraining.

Benefits of Localized Multilingual LLMs

Localized multilingual LLMs can bridge the language gap and unlock the potential of AI for a broader audience. They can answer questions in multiple languages, write high-quality code, solve complex reasoning problems, and use tools out-of-the-box or in a zero-shot way.

Table: Comparison of LLMs

Model Parameters Context Window Multilingual Support
Llama 3 405B 128K Yes
GPT-4 340B 8K Limited
Mistral 7B 8K Yes
Gemma 9B 8K Yes

Table: Performance Metrics

Model MMLU (5-shot) MMLU (0-shot, CoT)
Llama 3 87.3 88.6
GPT-4 89.1 88.7
Mistral 61.1 60.5
Gemma 72.3 72.3

Table: Language Support

Model Languages Supported
Llama 3 Over 30 languages
GPT-4 Limited
Mistral Multiple languages
Gemma Multiple languages

Table: Training Data

Model Training Data
Llama 3 15.6T tokens
GPT-4 Not specified
Mistral Not specified
Gemma Not specified

Table: Continual Pretraining

Model Continual Pretraining
Llama 3 Yes
GPT-4 Not specified
Mistral Not specified
Gemma Not specified

Conclusion

Training localized multilingual LLMs with NVIDIA NeMo is a crucial step towards making AI systems more inclusive and effective in diverse linguistic contexts. By following the steps outlined in this article, developers can create customized tokenizers and perform continual pretraining to enhance the performance of LLMs in low-resource languages. This approach not only improves the accuracy and robustness of LLMs but also opens up new possibilities for AI applications in various cultural and linguistic contexts.