Dynamic Memory Compression for Efficient LLMs

Summary

Dynamic Memory Compression (DMC) is a new technology developed by NVIDIA that enhances the efficiency of large language models (LLMs). DMC compresses the conversation state adaptively, reducing memory usage without altering the Transformer architecture. This allows LLMs to handle longer sequences without exhausting memory, leading to significant improvements in throughput and latency.

What is Dynamic Memory Compression?

Dynamic Memory Compression (DMC) is a method for on-line key–value cache compression at inference time. It is designed to address the memory limitations of large language models (LLMs) by compressing the conversation state adaptively. DMC is a simple way to compress the key–value (KV) cache during inference without incurring a performance drop.

How Does DMC Work?

DMC works by teaching pre-existing LLMs to adaptively compress the conversation state. This is achieved through a process called retrofitting, which involves training the model on a negligible percentage of the original data. The model learns to apply different compression ratios in different heads and layers, allowing it to compress the conversation state without sacrificing performance.

The compression process is based on a decision variable that determines whether to append or merge new key–value pairs with the existing cache. This decision is made separately for each token, layer, and head, allowing the model to adaptively compress the conversation state.

Benefits of DMC

DMC offers several benefits, including:

Improved Throughput: DMC can increase the throughput of LLMs by up to 700% by compressing the conversation state and freeing up memory.
Reduced Latency: DMC can reduce the latency of LLMs by allowing them to process more tokens at a time.
Longer Sequences: DMC enables LLMs to handle longer sequences without exhausting memory, making them more suitable for applications that require generating long sequences of tokens.
Preserved Performance: DMC preserves the original downstream performance of LLMs, even at high compression rates.

Results

The results of DMC are impressive, with the technology achieving a performance comparable to vanilla models on various downstream tasks, including factuality, common-sense question answering, and coding. The results are summarized in the following tables:

Model	Compression Rate	MMLU (Factuality)	Common-Sense QA	HumanEval (Coding)
Llama-2-7B	1x	44.6	70.5	14.0
Llama-2-7B	4x	44.2	70.2	16.5
Llama-2-7B	8x	41.8	70.1	16.5
Llama-2-13B	1x	54.5	73.5	17.5
Llama-2-13B	4x	54.2	73.2	22.0
Llama-2-13B	8x	52.1	73.3	21.3

Model	Compression Rate	MMLU (Factuality)
Llama-2-7B + DMC 4x	4x	44.2
Llama-2-7B + DMC 4x (8-bit quantization)	4x	44.6
Llama-2-7B + DMC 8x	8x	41.8
Llama-2-7B + DMC 8x (8-bit quantization)	8x	41.7

Conclusion

Dynamic Memory Compression (DMC) is a game-changing technology that enhances the efficiency of large language models (LLMs). By compressing the conversation state adaptively, DMC allows LLMs to handle longer sequences without exhausting memory, leading to significant improvements in throughput and latency. With its ability to preserve the original downstream performance of LLMs, DMC is an essential tool for anyone working with LLMs. Whether you’re a researcher, developer, or practitioner, DMC is a technology that you can’t afford to ignore.

What is Dynamic Memory Compression?#

How Does DMC Work?#

Benefits of DMC#

Results#

Conclusion#

What is Dynamic Memory Compression?

How Does DMC Work?

Benefits of DMC

Results

Conclusion