Building Next-Generation Enterprise Apps with Retrieval-Augmented Generation

Summary

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances the capabilities of large language models (LLMs) by integrating them with information retrieval systems. This approach allows enterprises to build applications that provide accurate, up-to-date responses to user queries by leveraging external knowledge sources. The NVIDIA Retrieval QA Embedding Model is a key component in building such RAG applications, offering state-of-the-art performance and customization options for various industries.

Introduction

Large language models have revolutionized the AI landscape with their deep understanding of human and programming languages. However, these models often struggle with real-time events and specific knowledge domains. This is where Retrieval-Augmented Generation comes into play, addressing these limitations by grounding LLMs in external knowledge sources.

What is Retrieval-Augmented Generation?

RAG is a process that guides artificial intelligence in generating relevant and coherent responses based on a prompt. It first appeared in 2020 data science research and has since gained significant attention for its potential to improve the quality and usability of LLM technologies in the enterprise space.

How RAG Works

RAG works by using an information retrieval system to identify and retrieve the most relevant data from external sources based on a user’s query. This retrieved information is then used to augment the prompt, providing the LLM with greater context and ensuring more accurate responses.

Benefits of RAG

  • Greater Control: Enterprises have more control over the data used by LLMs, ensuring that responses are based on approved company procedures, policies, or product information.
  • Improved Accuracy: RAG helps reduce instances of hallucinations by providing factual and up-to-date information.
  • Enhanced Context: The technique adds contextual information to the prompt, enabling LLMs to generate responses that are accurate and relevant.

The Role of NVIDIA Retrieval QA Embedding Model

The NVIDIA Retrieval QA Embedding Model is a crucial component in building RAG applications. It is an embedding model optimized for text question-answering retrieval, transforming textual information into dense vector representations that can be efficiently compared and searched.

Key Features

  • Architecture: The model is a fine-tuned version of E5-Large-Unsupervised, with 24 layers and an embedding size of 1024.
  • Training: It is trained on private and public datasets, supporting a maximum input of 512 tokens.
  • Performance: The model has been evaluated on various academic benchmarks, including NQ, HotpotQA, FiQA, and TechQA, showing superior performance compared to other retrieval models.

Evaluation Results

Retrieval Model Average Recall@5 on NQ, HotpotQA, FiQA, TechQA dataset
NVIDIA Retrieval QA 57.37
E5-Large_unsupervised 45.58
BM25 39.97

Applications of RAG

RAG has the potential to enhance various enterprise applications, including:

By combining search with a retrieval-based LLM, enterprises can provide high-quality, up-to-date responses with citations, significantly reducing instances of hallucinations.

Chatbots

Incorporating RAG with chatbots can lead to richer, context-aware conversations that engage customers and employees while satisfying their queries.

Content Generation

RAG can help businesses create content in areas such as marketing and human resources that are accurate and helpful to target audiences.

Building RAG Applications with NVIDIA

To build RAG applications, developers can use the NVIDIA Retrieval QA Embedding Model as part of the NVIDIA NeMo Retriever, which provides state-of-the-art, commercially-ready models and microservices optimized for the lowest latency and highest throughput.

Steps to Build RAG Apps

  1. Select the Model: Choose the appropriate pre-trained model, such as the NVIDIA Retrieval QA Embedding Model.
  2. Customize the Model: Customize the model for domain-specific use cases, such as IT, HR, or R&D research assistants.
  3. Integrate with Information Retrieval System: Integrate the model with an information retrieval system to retrieve relevant data from external sources.
  4. Deploy the Application: Deploy the RAG application using the NeMo Retriever Embedding Microservice.

Table: Comparison of Retrieval Models

Retrieval Model Average Recall@5 on Internal Customer Datasets
NVIDIA Retrieval QA 74.4
DRAGON* 72.7
E5-Large* 71.7
BGE* 71.1
GTR* 71.0
Contriever* 69.0
GTE* 63.9
E5-Large_unsupervised 61.6
BM25 55.6

Table: Key Features of NVIDIA Retrieval QA Embedding Model

Feature Description
Architecture Type Transformer
Network Architecture Fine-tuned E5-Large-Unsupervised retriever
Embedding Dimension 1024
Parameter Count 335 million
Input Type Text
Input Format List of strings
Output Type Floats
Output Format List of float arrays, each array containing the embeddings for the corresponding input string.

Conclusion

Retrieval-Augmented Generation is a powerful technique that enhances the capabilities of large language models by integrating them with information retrieval systems. The NVIDIA Retrieval QA Embedding Model is a key component in building such RAG applications, offering state-of-the-art performance and customization options for various industries. By leveraging RAG and the NVIDIA Retrieval QA Embedding Model, enterprises can build next-generation applications that provide accurate, up-to-date responses to user queries.