Summary

Dynamic Memory Compression (DMC) is a new technology developed by NVIDIA that enhances the efficiency of large language models (LLMs). DMC compresses the conversation state adaptively, reducing memory usage without altering the Transformer architecture. This allows LLMs to handle longer sequences without exhausting memory, leading to significant improvements in throughput and latency.

What is Dynamic Memory Compression?

Dynamic Memory Compression (DMC) is a method for on-line key–value cache compression at inference time. It is designed to address the memory limitations of large language models (LLMs) by compressing the conversation state adaptively. DMC is a simple way to compress the key–value (KV) cache during inference without incurring a performance drop.

How Does DMC Work?

DMC works by teaching pre-existing LLMs to adaptively compress the conversation state. This is achieved through a process called retrofitting, which involves training the model on a negligible percentage of the original data. The model learns to apply different compression ratios in different heads and layers, allowing it to compress the conversation state without sacrificing performance.

The compression process is based on a decision variable that determines whether to append or merge new key–value pairs with the existing cache. This decision is made separately for each token, layer, and head, allowing the model to adaptively compress the conversation state.

Benefits of DMC

DMC offers several benefits, including:

  • Improved Throughput: DMC can increase the throughput of LLMs by up to 700% by compressing the conversation state and freeing up memory.
  • Reduced Latency: DMC can reduce the latency of LLMs by allowing them to process more tokens at a time.
  • Longer Sequences: DMC enables LLMs to handle longer sequences without exhausting memory, making them more suitable for applications that require generating long sequences of tokens.
  • Preserved Performance: DMC preserves the original downstream performance of LLMs, even at high compression rates.

Results

The results of DMC are impressive, with the technology achieving a performance comparable to vanilla models on various downstream tasks, including factuality, common-sense question answering, and coding. The results are summarized in the following tables:

Model Compression Rate MMLU (Factuality) Common-Sense QA HumanEval (Coding)
Llama-2-7B 1x 44.6 70.5 14.0
Llama-2-7B 4x 44.2 70.2 16.5
Llama-2-7B 8x 41.8 70.1 16.5
Llama-2-13B 1x 54.5 73.5 17.5
Llama-2-13B 4x 54.2 73.2 22.0
Llama-2-13B 8x 52.1 73.3 21.3
Model Compression Rate MMLU (Factuality)
Llama-2-7B + DMC 4x 4x 44.2
Llama-2-7B + DMC 4x (8-bit quantization) 4x 44.6
Llama-2-7B + DMC 8x 8x 41.8
Llama-2-7B + DMC 8x (8-bit quantization) 8x 41.7

Conclusion

Dynamic Memory Compression (DMC) is a game-changing technology that enhances the efficiency of large language models (LLMs). By compressing the conversation state adaptively, DMC allows LLMs to handle longer sequences without exhausting memory, leading to significant improvements in throughput and latency. With its ability to preserve the original downstream performance of LLMs, DMC is an essential tool for anyone working with LLMs. Whether you’re a researcher, developer, or practitioner, DMC is a technology that you can’t afford to ignore.