Unlocking the Potential of Sovereign AI: iGenius and NVIDIA DGX Cloud’s Breakthrough in LLM Development
Summary: In recent years, large language models (LLMs) have made significant strides in areas like reasoning, code generation, machine translation, and summarization. However, these foundation models often fall short in domain-specific expertise and capturing cultural and linguistic nuances beyond English. To address these limitations, iGenius and NVIDIA DGX Cloud collaborated to develop Colosseum 355B, a state-of-the-art LLM tailored for regulated industries. This article delves into the challenges and solutions in developing sovereign LLMs, highlighting the importance of continued pretraining, instruction fine-tuning, and retrieval-augmented generation.
The Challenge of Domain-Specific Expertise
Large language models have achieved extraordinary progress, but they have limitations when it comes to domain-specific expertise such as finance or healthcare, and capturing cultural and language nuances beyond English. Overcoming these limitations requires further development using continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG). This necessitates high-quality, domain-specific datasets, a robust AI platform, and advanced AI expertise.
iGenius: Pioneering AI for Regulated Sectors
iGenius, an Italian technology company founded in 2016, specializes in AI solutions for enterprises in regulated sectors like financial services and public administration. With a mission to humanize data and democratize business knowledge, iGenius operates across Europe and the United States, delivering AI-driven insights to businesses and individuals alike.
Collaboration with NVIDIA DGX Cloud
To develop a cutting-edge foundational LLM within a tight timeline, iGenius collaborated with NVIDIA DGX Cloud. This partnership provided iGenius with access to large-scale GPU clusters and scalable training frameworks, enabling the creation of Colosseum 355B, a sovereign LLM tailored for highly regulated environments.
Continued Pretraining and Fine-Tuning
The development of Colosseum 355B involved continued pretraining and fine-tuning using supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). SFT refined the model’s parameters using labeled input-output pairs, while DPO aligned the model with human preferences by choosing between preferred and rejected responses.
Key Steps in Continued Pretraining
- Increasing Model Parameters: The model size was increased to 355B parameters.
- Increasing Context Length: The context length was extended from 4K to 16K.
- Achieving CPT in FP8: The model was trained in FP8 precision.
- Aligning Model Capabilities: The model was aligned for domain-specific expertise.
The Role of NVIDIA DGX Cloud
NVIDIA DGX Cloud provided iGenius with a fully optimized solution designed, built, and validated by NVIDIA. This included access to large-scale GPU clusters, high-performance storage, and the latest NVIDIA NeMo Framework containers.
Key Features of NVIDIA DGX Cloud
- Dedicated Infrastructure: Access to over 3K NVIDIA H100 GPUs.
- High-Bandwidth Network: RDMA-based network for model training communications.
- High-Performance Storage: 500 TB of Lustre-based storage.
- NVIDIA AI Expertise: Access to NVIDIA AI expertise for accelerated training and issue resolution.
Challenges and Best Practices
Training at scale presents unique challenges, such as checkpoint loading delays and network instability. iGenius adopted best practices like progressive scaling, robust checkpointing, and effective monitoring to ensure maximum utilization of their infrastructure.
Key Lessons Learned
- Explore Configurations at a Reduced Scale.
- Monitor Performance Metrics like MFU.
- Maintain Accurate Experiment Tracking.
Table: Key Features of Colosseum 355B
Feature | Description |
---|---|
Model Size | 355B parameters |
Context Length | 16K |
Precision | FP8 |
Training Framework | NVIDIA NeMo Framework |
Infrastructure | NVIDIA DGX Cloud with over 3K H100 GPUs |
Storage | 500 TB of Lustre-based storage |
Network | RDMA-based network |
Table: Performance Metrics
Metric | Value |
---|---|
MFU | 40% (initial), 33% (final) |
Accuracy | 82.04% in a 5-shot setting |
Training Time | Completed in two months |
Table: Key Steps in Continued Pretraining
Step | Description |
---|---|
1. Increase Model Parameters | Increased to 355B parameters |
2. Increase Context Length | Extended from 4K to 16K |
3. Achieve CPT in FP8 | Trained in FP8 precision |
4. Align Model Capabilities | Aligned for domain-specific expertise |
Conclusion
The collaboration between iGenius and NVIDIA DGX Cloud exemplifies how advanced AI infrastructure and expertise can accelerate the development of sovereign LLMs, paving the way for innovation in regulated sectors. By leveraging continued pretraining, fine-tuning, and robust AI platforms, organizations can overcome the limitations of foundation models and develop AI solutions that meet the specific needs of regulated industries.