Developing a 172B LLM with Strong Japanese Capabilities Using NVIDIA Megatron-LM

Summary

The development of large language models (LLMs) has revolutionized natural language processing (NLP), enabling AI to create content that traditional machine learning methods cannot. However, many existing models are predominantly trained on English data, leading to deficiencies in other languages, including Japanese. To address this, the Generative AI Accelerator Challenge (GENIAC) project used NVIDIA Megatron-LM to train a 172 billion parameter LLM with strong Japanese capabilities. This article explores the project’s objectives, training process, and results, highlighting the importance of efficient training frameworks like Megatron-LM in accelerating generative AI research and development.

Building a 172 Billion Parameter LLM for Japanese Language Processing

Generative AI has the ability to create entirely new content that traditional machine learning (ML) methods struggle to produce. In the field of natural language processing (NLP), the advent of large language models (LLMs) specifically has led to many innovative and creative AI use cases. These include customer support chatbots, voice assistants, text summarization, and translation—tasks previously handled by humans.

However, many models that currently top the LLM leaderboard show insufficient understanding and performance in non-English languages, including Japanese. One of the reasons for this is that the training corpus contains a high proportion of English data. For example, only 0.11% of the GPT-3 corpus is Japanese data. Creating LLM models that perform well in Japanese, which has less training data than English, has been immensely challenging.

GENIAC and the LLM-jp Project

The Ministry of Economy, Trade and Industry (METI) launched GENIAC to raise the level of platform model development capability in Japan and to encourage companies and others to be creative. GENIAC has provided computational resources, supported matching with companies and data holders, fostered collaboration with global technology companies, held community events, and evaluated the performance of the developed platform models.

The LLM-jp project to develop a completely open model with 172 billion parameters (available on Hugging Face) with strong Japanese language capabilities was selected for the GENIAC initiative. LLM-jp 172B was the largest model development in Japan at that time (February to August 2024), and it was meaningful to share the knowledge of its development widely.

Training the Model Using NVIDIA Megatron-LM

Megatron-LM serves as a lightweight research-oriented framework leveraging Megatron-Core for training LLMs at unparalleled speed. Megatron-Core, the main component, is an open-source library that contains GPU-optimized techniques and cutting-edge system-level optimizations essential for large-scale training.

Megatron-Core supports various advanced model parallelism techniques, including tensor, sequence, pipeline, context, and MoE expert parallelism. This library offers customizable building blocks, training resiliency features such as fast distributed checkpointing, and many other innovations such as Mamba-based hybrid model training. It’s compatible with all NVIDIA Tensor Core GPUs, and includes support for Transformer Engine (TE) with FP8 precision introduced with NVIDIA Hopper architecture.

Model Architecture and Training Settings

Parameter	Value
Hidden size	12288
FFN intermediate size	38464
Number of layers	96
Number of attention heads	96
Number of query groups	16
Activation function	SwiGLU
Position embedding	RoPE
Normalization	RMSNorm

The LLM-jp 172B model is being trained from scratch using 2.1 trillion tokens of a multilingual corpus developed for the project consisting mainly of Japanese and English. The training is performed using NVIDIA H100 Tensor Core GPUs on Google Cloud A3 Instance with FP8 hybrid training using the Transformer Engine. Megatron-Core v0.6 and Transformer Engine v1.4 are used in the experiment.

Hyperparameters Used for the Model Training

Parameter	Value
LR	1E-4
min LR	1E-5
LR WARMUP iters	2000
Weight decay	0.1
Grad clip	1.0
Global batch size	1728
Context length	4096

In addition, z-loss and batch-skipping techniques, which are used in PaLM, are incorporated to stabilize the training process, and flash attention is used to further speed up the training process.

Training Throughput and Results

Pretraining for the latest LLM-jp 172B model is currently underway, with periodic evaluations every few thousand iterations to monitor training progress and ensure successful accuracy results on Japanese and English downstream tasks. Notably, there is a sharp increase in TFLOP/s after approximately 7,000 iterations, corresponding to the transition from BF16 to FP8-hybrid precision.

The transition to FP8 hybrid training has significantly increased training throughput, demonstrating a 1.4x speed acceleration from 400 to 550 TFLOP/s. This improvement highlights FP8 hybrid training as a promising method for enhancing large-scale model pretraining efficiency.

Training time is often a significant challenge in pretraining LLMs, where vast datasets are required. Therefore, efficient training frameworks like Megatron-LM are crucial for accelerating Generative AI research and development. For the 172B model trained with Megatron-LM, we explored FP8-hybrid training as a potential method for improving training speed, achieving a 1.4x training speed acceleration from 400 TFLOP/s to 550 TFLOP/s. This suggests that FP8-hybrid could be a valuable approach for enhancing the efficiency of large-scale model pretraining.

Conclusion

The ongoing development of the LLM-jp 172B model not only aims to advance Japanese language capabilities but also sets a precedent for future multilingual AI models. The project underscores the importance of efficient training frameworks like Megatron-LM in accelerating generative AI research and development.

Building a 172 Billion Parameter LLM for Japanese Language Processing#

GENIAC and the LLM-jp Project#

Training the Model Using NVIDIA Megatron-LM#

Model Architecture and Training Settings#

Hyperparameters Used for the Model Training#

Training Throughput and Results#

Conclusion#