Summary:

Perplexity AI, an AI-powered search engine, handles over 435 million search queries each month. To meet this demand, Perplexity uses NVIDIA’s inference stack, including H100 Tensor Core GPUs, NVIDIA Triton Inference Server, and NVIDIA TensorRT-LLM. This setup allows Perplexity to serve multiple AI models simultaneously, optimize GPU utilization, and meet strict service-level agreements (SLAs). By leveraging NVIDIA’s technology, Perplexity has achieved significant cost savings and improved performance.

Scaling AI Search with Perplexity and NVIDIA

Perplexity AI is at the forefront of AI-powered search, handling over 435 million search queries each month. This massive workload requires a robust and scalable infrastructure to ensure optimal user experience. To meet this challenge, Perplexity turned to NVIDIA’s inference stack, which has proven to be a game-changer in delivering high-throughput and low-latency inference.

Serving Multiple AI Models Simultaneously

Perplexity’s inference team serves over 20 AI models simultaneously, including different variations of the popular open-source Llama 3.1 models like 8B, 70B, and 405B. To match each user request with the appropriate model, the company relies on smaller classifier models that help determine user intent. User tasks detected by the classifiers, like text completion, are then routed to specific models deployed on GPU pods.

Optimizing GPU Utilization

Each GPU pod consists of one or more NVIDIA H100 GPUs and is managed by an NVIDIA Triton Inference Server instance. The pods operate under strict SLAs for both cost efficiency and user interactivity. To accommodate the large Perplexity user base and fluctuating traffic throughout the day, the pods are hosted within a Kubernetes cluster. They feature a front-end scheduler built in-house that routes traffic to the appropriate pod based on their load and usage, ensuring that the SLAs are consistently met.

Meeting Strict Service-Level Agreements

To define the right SLAs for the company’s diverse use cases, Perplexity’s inference team conducts comprehensive A/B testing, evaluating different configurations and their impact on user experience. Their goal is to maximize GPU utilization while consistently meeting the target SLA for each specific use case. By improving batching while meeting target SLAs, inference serving cost is optimized.

Cost Savings and Performance Improvements

By leveraging NVIDIA’s technology, Perplexity has achieved significant cost savings and improved performance. For example, the team estimated that they were able to save approximately $1 million annually by serving models that power their Related-Questions feature on cloud-hosted NVIDIA GPUs, compared to sending the same request volume to a third-party LLM API service provider.

Disaggregating Serving and Future Plans

The Perplexity team is actively collaborating with the NVIDIA Triton engineering team to deploy disaggregating serving, a groundbreaking technique that separates the prefill and decode inference phases of an LLM workflow onto separate NVIDIA GPUs. This technique significantly boosts overall system throughput while meeting SLAs, translating to lower cost per token. Additionally, this technique gives Perplexity the flexibility to use different NVIDIA GPU products for each inference phase given its specific hardware resource requirements.

Hardware Innovations

The Perplexity team understands that optimizing the software stack can only drive performance improvements to a certain extent. To deliver new levels of performance, hardware innovations are crucial. This is why they are eager to assess the NVIDIA Blackwell platform.

Conclusion:

Perplexity AI’s success in handling over 435 million search queries each month is a testament to the power of NVIDIA’s inference stack. By leveraging NVIDIA’s technology, Perplexity has achieved significant cost savings and improved performance. As AI-powered search continues to grow, the partnership between Perplexity and NVIDIA will be crucial in delivering high-throughput and low-latency inference. With future plans to deploy disaggregating serving and assess the NVIDIA Blackwell platform, Perplexity is poised to take AI-powered search to new heights.

Table: Perplexity AI’s Inference Stack

Component Description
NVIDIA H100 Tensor Core GPUs High-performance GPUs for AI inference
NVIDIA Triton Inference Server Software framework for deploying and managing AI models
NVIDIA TensorRT-LLM Optimization tool for large language models
Kubernetes Cluster Container orchestration system for scalable deployment
Front-end Scheduler Custom-built scheduler for routing traffic to GPU pods

Table: Benefits of Perplexity AI’s Inference Stack

Benefit Description
Cost Savings Significant reduction in inference serving cost
Improved Performance High-throughput and low-latency inference
Scalability Ability to handle massive query loads efficiently
Flexibility Use of different NVIDIA GPU products for each inference phase
Future-Proof Plans to deploy disaggregating serving and assess NVIDIA Blackwell platform