Summary
Dynamic Memory Compression (DMC) is a new technology developed by NVIDIA that enhances the efficiency of large language models (LLMs). DMC compresses the conversation state adaptively, reducing memory usage without altering the Transformer architecture. This allows LLMs to handle longer sequences without exhausting memory, leading to significant improvements in throughput and latency.
What is Dynamic Memory Compression?
Dynamic Memory Compression (DMC) is a method for on-line key–value cache compression at inference time. It is designed to address the memory limitations of large language models (LLMs) by compressing the conversation state adaptively. DMC is a simple way to compress the key–value (KV) cache during inference without incurring a performance drop.
How Does DMC Work?
DMC works by teaching pre-existing LLMs to adaptively compress the conversation state. This is achieved through a process called retrofitting, which involves training the model on a negligible percentage of the original data. The model learns to apply different compression ratios in different heads and layers, allowing it to compress the conversation state without sacrificing performance.
The compression process is based on a decision variable that determines whether to append or merge new key–value pairs with the existing cache. This decision is made separately for each token, layer, and head, allowing the model to adaptively compress the conversation state.
Benefits of DMC
DMC offers several benefits, including:
- Improved Throughput: DMC can increase the throughput of LLMs by up to 700% by compressing the conversation state and freeing up memory.
- Reduced Latency: DMC can reduce the latency of LLMs by allowing them to process more tokens at a time.
- Longer Sequences: DMC enables LLMs to handle longer sequences without exhausting memory, making them more suitable for applications that require generating long sequences of tokens.
- Preserved Performance: DMC preserves the original downstream performance of LLMs, even at high compression rates.
Results
The results of DMC are impressive, with the technology achieving a performance comparable to vanilla models on various downstream tasks, including factuality, common-sense question answering, and coding. The results are summarized in the following tables:
Model | Compression Rate | MMLU (Factuality) | Common-Sense QA | HumanEval (Coding) |
---|---|---|---|---|
Llama-2-7B | 1x | 44.6 | 70.5 | 14.0 |
Llama-2-7B | 4x | 44.2 | 70.2 | 16.5 |
Llama-2-7B | 8x | 41.8 | 70.1 | 16.5 |
Llama-2-13B | 1x | 54.5 | 73.5 | 17.5 |
Llama-2-13B | 4x | 54.2 | 73.2 | 22.0 |
Llama-2-13B | 8x | 52.1 | 73.3 | 21.3 |
Model | Compression Rate | MMLU (Factuality) |
---|---|---|
Llama-2-7B + DMC 4x | 4x | 44.2 |
Llama-2-7B + DMC 4x (8-bit quantization) | 4x | 44.6 |
Llama-2-7B + DMC 8x | 8x | 41.8 |
Llama-2-7B + DMC 8x (8-bit quantization) | 8x | 41.7 |
Conclusion
Dynamic Memory Compression (DMC) is a game-changing technology that enhances the efficiency of large language models (LLMs). By compressing the conversation state adaptively, DMC allows LLMs to handle longer sequences without exhausting memory, leading to significant improvements in throughput and latency. With its ability to preserve the original downstream performance of LLMs, DMC is an essential tool for anyone working with LLMs. Whether you’re a researcher, developer, or practitioner, DMC is a technology that you can’t afford to ignore.