Optimizing DLRM on NVIDIA GPUs

Summary

Deep Learning Recommendation Models (DLRMs) are becoming increasingly important for driving user engagement on online platforms. However, training these models poses significant challenges due to their computational and memory requirements. This article explores how NVIDIA GPUs can be optimized to accelerate DLRM training and inference, focusing on the NVIDIA GPU-accelerated DL model portfolio and the use of tools like NVIDIA HugeCTR and NVIDIA Merlin.

Speeding Up Deep Learning Recommendation Models with NVIDIA GPUs

Deep Learning Recommendation Models (DLRMs) are a critical component for driving user engagement on many online platforms. With the rapid growth in scale of industry datasets, DLRMs, which capitalize on large amounts of training data, have started to show promising results. However, training these models poses significant challenges due to their computational and memory requirements.

Challenges in Training DLRMs

Training DLRMs involves handling huge datasets, complex data preprocessing and feature engineering pipelines, as well as extensive repeated experimentation. These challenges necessitate the use of high-performance computing solutions to meet the computational demands for large-scale DL recommender systems training and inference.

NVIDIA GPU-Accelerated DL Model Portfolio

NVIDIA’s GPU-accelerated DL model portfolio covers a wide range of network architectures and applications in many different domains, including image, text, and speech analysis, and recommender systems. The Deep Learning Recommendation Model (DLRM) is part of this portfolio and is designed to tackle the challenges mentioned above.

Optimizing DLRM on NVIDIA GPUs

To optimize DLRM on NVIDIA GPUs, several strategies are employed:

Data Preprocessing: New Spark-on-GPU tools are introduced to speed up data preprocessing tasks on massive datasets.
Automatic Mixed Precision Training: Using NVIDIA Tensor Core GPUs, an optimized data loader, and a custom embedding CUDA kernel, DLRM models can be trained on the Criteo Terabyte dataset in just 44 minutes, compared to 36.5 hours on 96-CPU threads.
Embedding Layers: Embedding layers are memory bandwidth-constrained. To efficiently use the available memory bandwidth, all categorical embedding tables are combined into one single table, and a custom kernel is used to perform embedding lookups.
Deployment with NVIDIA Triton Inference Server: Trained DLRM models can be deployed into production with the NVIDIA Triton Inference Server, which provides a cloud inferencing solution optimized for NVIDIA GPUs.

End-to-End Training Pipeline

An end-to-end training pipeline on the Criteo Terabyte data is provided to help users get started with just a few simple steps:

Clone the Repository:

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Recommendation/DLRM

Build a DLRM Docker Container:
```
docker build . -t nvidia_dlrm_pyt
```

Start an Interactive Session:

mkdir -p data
docker run --runtime=nvidia -it --rm --ipc=host -v ${PWD}/data:/data nvidia_dlrm_pyt bash

Performance Improvements

Spark-GPU Plugin: The Spark-GPU plugin for DLRM can further speed up the processing time by a factor of up to 43X compared to an equivalent Spark-CPU pipeline.
Static Batch Throughput: NVIDIA Triton Inference Server provides up to 20x improvement in throughput compared to an 80-thread CPU inference.
Inference Latency: At a batch size of 8192, a V100 32-GB GPU reduces the latency by 19x compared to an 80-thread CPU inference.

Best Practices for Building and Deploying Recommender Systems

Split Embedding Tables: Frequently accessed rows can stay in the GPU memory, and less popular rows can remain in CPU memory.
Fusing Embedding Tables: Combining tables when possible can give better efficiency in terms of GPU execution and scheduling.
NVIDIA HugeCTR: HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).

NVIDIA Merlin

NVIDIA Merlin is an open-source application framework for building high-performance, DL-based recommender systems. It facilitates and accelerates recommender systems on GPU, speeding up common ETL tasks, training of models, and inference serving by ~10x over commonly used methods.

Conclusion

Optimizing DLRMs on NVIDIA GPUs can significantly accelerate training and inference times, making them more practical for large-scale applications. By leveraging tools like NVIDIA HugeCTR and NVIDIA Merlin, and following best practices for building and deploying recommender systems, developers can achieve high-performance and efficient DLRM solutions. This article has highlighted the main strategies and tools available for optimizing DLRMs on NVIDIA GPUs, providing a comprehensive guide for developers looking to improve their recommender systems.

Speeding Up Deep Learning Recommendation Models with NVIDIA GPUs#

Challenges in Training DLRMs#

NVIDIA GPU-Accelerated DL Model Portfolio#

Optimizing DLRM on NVIDIA GPUs#

End-to-End Training Pipeline#

Performance Improvements#

Best Practices for Building and Deploying Recommender Systems#

NVIDIA Merlin#

Conclusion#

Speeding Up Deep Learning Recommendation Models with NVIDIA GPUs

Challenges in Training DLRMs

NVIDIA GPU-Accelerated DL Model Portfolio

Optimizing DLRM on NVIDIA GPUs

End-to-End Training Pipeline

Performance Improvements

Best Practices for Building and Deploying Recommender Systems

NVIDIA Merlin

Conclusion