Unlocking the Power of UMAP on GPUs: How RAPIDS cuML Revolutionizes Dimension Reduction

Summary

The Uniform Manifold Approximation and Projection (UMAP) algorithm is a widely used tool for dimension reduction in various fields such as bioinformatics, NLP topic modeling, and machine learning preprocessing. However, handling large datasets with traditional CPU-based UMAP can be time-consuming and inefficient. This article explores how RAPIDS cuML, with its latest enhancements, accelerates and scales UMAP on GPUs, making it faster and more scalable than ever before.

The Challenges of Traditional UMAP

Traditional UMAP faces two significant challenges when dealing with large datasets. First, the entire dataset must fit into the memory of the GPU, which can be particularly challenging with consumer-level NVIDIA RTX GPUs that have limited memory. Even high-end GPUs like the NVIDIA H100, with 80 GB of memory, may not be sufficient for datasets that exceed this size due to the numerous temporary memory allocations required by algorithms like UMAP.

Accelerating and Scaling UMAP with RAPIDS cuML

RAPIDS cuML addresses these challenges with a novel batched approximate nearest neighbor (ANN) algorithm. This approach uses a GPU-accelerated version of the nearest neighbors descent (nn-descent) algorithm from the RAPIDS cuVS library, which is ideal for constructing all-neighbors graphs. ANN algorithms accelerate the graph-building process by trading off quality for speed, reducing the number of distances that need to be computed to find the nearest neighbors.

The Power of Batched All-Neighbors Graph Construction

The batched approach allows for processing larger datasets by breaking down the computation into manageable pieces. This means that only the necessary parts of the dataset are loaded into GPU memory at any given time, overcoming the memory limitations of traditional UMAP. This method not only speeds up the process but also enables the handling of datasets that were previously too large to fit into GPU memory.

Performance Improvements

The performance improvements achieved by RAPIDS cuML are significant. Benchmark results show that UMAP can now run with datasets that don’t fit on the device, such as a 50M point dataset with 768 dimensions, which is 150 GB in size. The speedup gains increase for large datasets, with some tasks that used to take 10 hours now completing in just 2 minutes.

Comparison of End-to-End Runtime

The following table illustrates the dramatic speedup achieved by using the nn-descent algorithm compared to the brute-force method:

Matrix Size Running UMAP with Brute-Force Running UMAP with NN-Descent Speedup
1M x 960 214.4s 9.9s 21.6x
8M x 384 2191.3s 34.0s 54.4x
10M x 96 2170.8s 53.4s 40.6x
20M x 384 38350.7s 122.9s 312x
59M x 768 Error: out of memory 575.1s -

Getting Started with RAPIDS cuML

To start using RAPIDS cuML and experience the power of accelerated UMAP, follow the RAPIDS Installation Guide for conda and pip packages, as well as ready-to-go Docker containers. With these tools, you can unlock the full potential of UMAP on GPUs and transform your data analysis workflows.

Conclusion

The enhancements in RAPIDS cuML 24.10 have revolutionized the performance and scalability of UMAP on GPUs. By leveraging the nn-descent algorithm and batched all-neighbors graph construction, RAPIDS cuML enables data scientists to process large-scale datasets at unprecedented speeds. This breakthrough opens up new possibilities for exploratory data analysis and machine learning applications, making it an invaluable tool for the data science community.