Entropy-Guided KV Caching for Efficient LLM Inference
Problem Statement
KV cache memory and compute costs scale quadratically with context length, making long-context LLM deployment expensive. Existing KV cache compression methods often apply uniform budgets across layers or select tokens independently per head, ignoring layer-wise attention diversity and breaking multi-head attention structural coherence. A principled mechanism to differentiate cache allocation based on each layer's informational role is missing.
Key Novelty
- Entropy-based per-layer cache budget allocation: layers with higher attention entropy (broader context utilization) receive larger KV cache budgets, while low-entropy sink layers receive smaller budgets
- Layer-level token selection via aggregated attention scores across all heads, rather than independent per-head selection, preserving multi-head attention structural integrity
- Use of attention weight entropy as a proxy for contextual importance, providing a principled and interpretable signal for dynamic cache compression during the prefilling phase
Evaluation Highlights
- On Qwen3 4B, memory usage reduced by 4.18% while preserving ROUGE score, demonstrating lossless compression
- On Mistral 0.1v 7B, decoding time reduced by 46.6%, demonstrating significant inference speedup with entropy-guided cache budgeting
Breakthrough Assessment
Methodology
- During the prefilling phase, compute attention weights for each head in every Transformer layer and calculate per-head entropy of the attention distribution
- Average the entropy values across all heads within a layer to obtain a layer-level contextual importance score; assign larger KV cache budgets to high-entropy layers and smaller budgets to low-entropy (sink) layers
- For each layer, aggregate attention scores across all heads to select a single common set of the most important tokens, caching those key-value pairs uniformly across all heads within that layer for use during decoding
System Components
Computes Shannon entropy of attention weight distributions for each head, quantifying how broadly or narrowly a head attends to context tokens
Averages per-head entropies within a layer to produce a single scalar representing the layer's contextual diversity and importance
Maps layer importance scores to KV cache size budgets, assigning more cache capacity to high-entropy layers and less to low-entropy sink layers
Combines attention scores across all heads within a layer to identify a shared set of important tokens, ensuring a single unified KV cache per layer rather than per-head caches
Results
| Metric/Benchmark | Baseline | This Paper | Delta |
|---|---|---|---|
| Memory Usage (Qwen3 4B) | Full KV cache (100%) | ~95.82% of baseline | -4.18% |
| Decoding Time (Mistral 0.1v 7B) | Full KV cache baseline | ~53.4% of baseline | -46.6% |
| ROUGE Score (Qwen3 4B) | Baseline ROUGE | Preserved (no degradation) | ~0% |
| Generation Quality (general) | Full cache quality | Maintained | Negligible drop |
Key Takeaways
- Attention entropy is a cheap-to-compute, interpretable signal that can guide non-uniform KV cache allocation across layers, offering a drop-in optimization for long-context inference without retraining
- Selecting a shared token set per layer (rather than per head) is a practical design choice that reduces implementation complexity and preserves multi-head attention coherence while still achieving meaningful compression
- The approach is model-agnostic and shows meaningful gains on both a mid-size model (Qwen3 4B) and a popular open-source model (Mistral 7B), suggesting broad applicability across modern LLM architectures
Abstract
Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.