Horizontal Autoscaling of NVIDIA NIM Microservices on Kubernetes

Summary

NVIDIA has introduced a comprehensive approach to horizontally autoscale its NIM microservices on Kubernetes, leveraging custom metrics for efficient resource allocation. This method utilizes Kubernetes Horizontal Pod Autoscaling (HPA) to dynamically adjust resources based on metrics such as GPU cache usage, optimizing compute and memory usage. In this article, we will delve into the details of NVIDIA’s approach to autoscaling NIM microservices on Kubernetes.

Scaling AI Inference Pipelines with NVIDIA NIM Microservices

NVIDIA NIM microservices are model inference containers deployable on Kubernetes, crucial for managing large-scale machine learning models. These microservices necessitate a clear understanding of their compute and memory profiles in a production environment to ensure efficient autoscaling.

Understanding NVIDIA NIM Microservices

NVIDIA NIM microservices serve as the backbone for AI inference pipelines, enabling the deployment of generative AI models across various platforms. These microservices are designed to be cloud-native, easy to use, and scalable, reducing the time-to-market for AI applications.

Setting Up Autoscaling

To set up autoscaling for NIM microservices, a Kubernetes cluster must be equipped with essential components such as the Kubernetes Metrics Server, Prometheus, Prometheus Adapter, and Grafana. These tools are integral for scraping and displaying metrics required for the HPA service.

Component	Function
Kubernetes Metrics Server	Collects resource metrics from Kubelets and exposes them via the Kubernetes API Server.
Prometheus	Scrapes metrics from pods and creates dashboards.
Prometheus Adapter	Allows HPA to utilize custom metrics for scaling strategies.
Grafana	Visualizes custom metrics, facilitating the monitoring and adjustment of resource allocation.

Deploying NIM Microservices

NVIDIA provides a detailed guide for deploying NIM microservices, specifically using the NIM for LLMs model. This involves setting up the necessary infrastructure and ensuring the NIM for LLMs microservice is ready for scaling based on GPU cache usage metrics.

Implementing Horizontal Pod Autoscaling

To implement HPA, NVIDIA demonstrates creating an HPA resource focused on the gpu_cache_usage_perc metric. By running load tests at different concurrency levels, the HPA automatically adjusts the number of pods to maintain optimal performance, demonstrating its effectiveness in handling fluctuating workloads.

Future Prospects

NVIDIA’s approach opens avenues for further exploration, such as scaling based on multiple metrics like request latency or GPU compute utilization. Additionally, leveraging Prometheus Query Language (PromQL) to create new metrics can enhance the autoscaling capabilities.

Horizontal Pod Autoscaling in Kubernetes

Kubernetes Horizontal Pod Autoscaling (HPA) is a powerful feature that automatically scales the number of Pods in a Deployment, ReplicaSet, or StatefulSet based on observed CPU utilization or other select metrics.

How HPA Works

HPA is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The horizontal pod autoscaling controller, running within the Kubernetes control plane, periodically adjusts the desired scale of its target to match observed metrics.

Step	Description
1	The controller manager queries the resource utilization against the metrics specified in each HPA definition.
2	The controller manager finds the target resource defined by the `scaleTargetRef`.
3	The controller manager selects the pods based on the target resource’s `.spec.selector` labels.
4	The controller manager obtains the metrics from either the resource metrics API or the custom metrics API.

Benefits of HPA

HPA provides several benefits, including:

Efficient Resource Allocation: HPA ensures that resources are allocated efficiently, reducing waste and improving performance.
Scalability: HPA enables applications to scale dynamically, handling fluctuating workloads with ease.
Flexibility: HPA supports custom metrics, allowing for fine-grained control over scaling strategies.

Conclusion

NVIDIA’s approach to horizontally autoscaling NIM microservices on Kubernetes is a significant step forward in optimizing AI inference pipelines. By leveraging custom metrics and HPA, NVIDIA has demonstrated a comprehensive method for efficient resource allocation and dynamic scaling. This approach not only improves performance but also reduces the complexity of managing AI inference pipelines, making it easier for developers to deploy and scale AI applications.

Scaling AI Inference Pipelines with NVIDIA NIM Microservices#

Understanding NVIDIA NIM Microservices#

Setting Up Autoscaling#

Deploying NIM Microservices#

Implementing Horizontal Pod Autoscaling#

Future Prospects#

Horizontal Pod Autoscaling in Kubernetes#

How HPA Works#

Benefits of HPA#

Conclusion#