Summary

In today’s globalized world, the ability of AI systems to understand and communicate in diverse languages is increasingly crucial. Large language models (LLMs) have revolutionized the field of natural language processing, but most mainstream LLMs are trained on data corpora that primarily consist of English, limiting their applicability to other languages and cultural contexts. This article explores how to train localized multilingual LLMs using NVIDIA NeMo, focusing on adding new language support to base LLMs.

Training Localized Multilingual LLMs: Bridging the Language Gap

The rise of large language models (LLMs) has significantly advanced natural language processing capabilities, enabling AI to generate human-like text, answer questions, and perform various language tasks. However, these models often struggle with languages other than English due to limited training data and unique linguistic characteristics. This limitation is particularly evident in Southeast Asian (SEA) languages, where current state-of-the-art LLMs exhibit lower performance compared to high-resource languages like English.

The Need for Localized Multilingual LLMs

The demand for AI solutions that can understand and communicate in diverse languages is growing rapidly. In regions like Southeast Asia, the development of localized multilingual LLMs is a strategic necessity. For instance, Singapore has launched a S$70M initiative to develop the National Multimodal Large Language Model Programme (NMLP), aiming to build Southeast Asia’s first regional LLM that focuses on understanding the region’s unique linguistic and cultural nuances.

Challenges in Training Multilingual LLMs

One significant challenge in training multilingual LLMs is the insufficiency of pretrained foundation LLMs that understand the target language. To construct a multilingual LLM, there are two primary options:

  1. Pretraining from Scratch: Use a multilingual dataset to pretrain the LLM from scratch. This approach requires a large amount of data in the low-resource language, which is often not available.
  2. Continual Pretraining: Adopt continual pretraining on English foundation models using a dataset of the target language. This method is more feasible for low-resource languages, as it leverages transfer learning from the high-resource language on which the model was originally trained.

Customizing Tokenizers for Low-Resource Languages

The majority of foundation models adopt byte-pair encoding (BPE) tokenizers, which do not adequately cover the unique characters, subwords, and morphology of low-resource languages. To address this issue, it is necessary to build a customized tokenizer that enables the model to process and learn from the low-resource language data more effectively during continual pretraining.

Workflow for Adding New Language Support

To add new language support to base LLMs using NVIDIA NeMo, the following workflow is proposed:

  1. Download and Extract the GPT Model: Obtain model weights and the model tokenizer.
  2. Customize the Tokenizer Training and Merge: Train a monolingual tokenizer and merge it with the original English tokenizer to form a bilingual tokenizer.
  3. Modify the GPT Model Architecture: Accommodate the bilingual tokenizer.
  4. Perform Continual Pretraining: Use the target language dataset for continual pretraining.

Example with Thai Wikipedia Data

This tutorial uses Thai Wikipedia data as an example input for the workflow. The steps include:

  1. Data Collection and Cleaning: Use NVIDIA NeMo Curator to download and curate high-quality Thai Wikipedia data.
  2. Tokenizer Training: Train a monolingual tokenizer and merge it with the original English tokenizer.
  3. Continual Pretraining: Perform continual pretraining using the Thai Wikipedia data.

NVIDIA NeMo Framework

NVIDIA NeMo is an end-to-end platform for developing custom generative AI. It includes tools for training, retrieval-augmented generation (RAG), guardrailing, and toolkits, data curation tools, and pretrained models. NeMo provides an easy, cost-effective, and fast way to adopt generative AI.

Benefits of Localized Multilingual LLMs

Localized multilingual LLMs can help businesses and organizations better serve their customers, automate processes, and create more engaging content that resonates with the region’s diverse population. These models can bridge the language gap and unlock the potential of AI for a broader audience.

Conclusion

Training localized multilingual LLMs using NVIDIA NeMo is a crucial step in bridging the language gap and unlocking the potential of AI for diverse languages and cultural contexts. By customizing tokenizers for low-resource languages and adopting continual pretraining, it is possible to develop LLMs that can understand and communicate effectively in languages other than English. The NVIDIA NeMo framework provides a comprehensive platform for developing custom generative AI, making it easier to adopt and deploy localized multilingual LLMs.