Entropy-Guided KV Caching for Efficient LLM Inference

An entropy-guided KV cache budgeting strategy that allocates larger cache budgets to high-entropy (broadly attending) Transformer layers and smaller budgets to low-entropy (sink-like) layers, enabling efficient long-context LLM inference without sacrificing generation quality.

Problem Statement

KV cache memory and compute costs scale quadratically with context length, making long-context LLM deployment expensive. Existing KV cache compression methods often apply uniform budgets across layers or select tokens independently per head, ignoring layer-wise attention diversity and breaking multi-head attention structural coherence. A principled mechanism to differentiate cache allocation based on each layer's informational role is missing.

Key Novelty

Entropy-based per-layer cache budget allocation: layers with higher attention entropy (broader context utilization) receive larger KV cache budgets, while low-entropy sink layers receive smaller budgets
Layer-level token selection via aggregated attention scores across all heads, rather than independent per-head selection, preserving multi-head attention structural integrity
Use of attention weight entropy as a proxy for contextual importance, providing a principled and interpretable signal for dynamic cache compression during the prefilling phase

Evaluation Highlights

On Qwen3 4B, memory usage reduced by 4.18% while preserving ROUGE score, demonstrating lossless compression
On Mistral 0.1v 7B, decoding time reduced by 46.6%, demonstrating significant inference speedup with entropy-guided cache budgeting

Breakthrough Assessment

5/10 The paper presents a solid and principled contribution by introducing attention entropy as a layer-importance signal for KV cache budgeting, with practical speedup results. However, it is incremental relative to the broad KV cache compression literature and does not fundamentally change the paradigm of inference optimization.

Methodology

During the prefilling phase, compute attention weights for each head in every Transformer layer and calculate per-head entropy of the attention distribution
Average the entropy values across all heads within a layer to obtain a layer-level contextual importance score; assign larger KV cache budgets to high-entropy layers and smaller budgets to low-entropy (sink) layers
For each layer, aggregate attention scores across all heads to select a single common set of the most important tokens, caching those key-value pairs uniformly across all heads within that layer for use during decoding

System Components

Attention Entropy Calculator

Computes Shannon entropy of attention weight distributions for each head, quantifying how broadly or narrowly a head attends to context tokens

Layer Importance Scorer

Averages per-head entropies within a layer to produce a single scalar representing the layer's contextual diversity and importance

Adaptive Budget Allocator

Maps layer importance scores to KV cache size budgets, assigning more cache capacity to high-entropy layers and less to low-entropy sink layers

Aggregated Token Selector

Combines attention scores across all heads within a layer to identify a shared set of important tokens, ensuring a single unified KV cache per layer rather than per-head caches

Results

Metric/Benchmark	Baseline	This Paper	Delta
Memory Usage (Qwen3 4B)	Full KV cache (100%)	~95.82% of baseline	-4.18%
Decoding Time (Mistral 0.1v 7B)	Full KV cache baseline	~53.4% of baseline	-46.6%
ROUGE Score (Qwen3 4B)	Baseline ROUGE	Preserved (no degradation)	~0%
Generation Quality (general)	Full cache quality	Maintained	Negligible drop

Key Takeaways

Attention entropy is a cheap-to-compute, interpretable signal that can guide non-uniform KV cache allocation across layers, offering a drop-in optimization for long-context inference without retraining
Selecting a shared token set per layer (rather than per head) is a practical design choice that reduces implementation complexity and preserves multi-head attention coherence while still achieving meaningful compression
The approach is model-agnostic and shows meaningful gains on both a mid-size model (Qwen3 4B) and a popular open-source model (Mistral 7B), suggesting broad applicability across modern LLM architectures

Abstract

Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.