Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference
Problem Statement
Long-context LLM inference is bottlenecked by KV cache memory, which grows linearly with sequence length and exhausts GPU memory. Existing eviction-based approaches (e.g., StreamingLLM, H2O) permanently discard tokens, risking irreversible information loss. A reversible, memory-efficient alternative that preserves full context fidelity is needed for practical deployment.
Key Novelty
- Reversible soft-freeze mechanism that suspends KV updates for low-importance tokens while preserving them in off-GPU storage for on-demand restoration, unlike permanent eviction methods
- Entropy-guided importance detection within a sliding attention window to identify which tokens can be safely frozen without quality degradation
- Sublinear freeze scheduling where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression and balancing memory savings with fidelity
Evaluation Highlights
- 55-67% reduction in active KV cache size on LLaMA-3 8B while maintaining generation quality
- Passes needle-in-haystack retrieval tests, demonstrating preserved long-range context retrieval despite cache compression
Breakthrough Assessment
Methodology
- At each generation step, compute attention entropy within a sliding window to score token importance; low-entropy (low-attention-variance) tokens are flagged as candidates for freezing
- Apply the soft-freeze: suspend active KV cache updates for flagged tokens and offload their KV entries to CPU/off-GPU storage, freeing GPU memory while retaining recoverability
- Use sublinear freeze scheduling to determine freeze duration per token based on repeated low-importance detections; restore frozen tokens on demand via entropy-guided recovery when their importance rises above threshold
System Components
A local window over recent tokens used to compute per-token attention entropy scores that proxy importance for freeze decisions
Measures attention entropy per token to identify low-importance tokens suitable for KV freezing, and triggers recovery when entropy indicates renewed relevance
Reversibly suspends KV cache updates for low-importance tokens by offloading their entries to off-GPU (CPU) storage instead of permanently deleting them
Controls how long a token remains frozen using a sublinear function of repeated low-importance detections, preventing runaway compression of frequently low-scored tokens
Restores frozen KV entries from off-GPU storage back to GPU when the entropy-guided scorer determines a previously frozen token has become important again
Results
| Metric/Benchmark | Baseline (Full KV Cache) | ASR-KF-EGR | Delta |
|---|---|---|---|
| Active KV Cache Size (LLaMA-3 8B) | 100% (full cache) | 33-45% of full cache | -55% to -67% |
| Needle-in-Haystack Retrieval | Pass | Pass | No degradation |
| Generation Quality | Reference quality | Maintained (qualitative) | No significant loss |
| Fine-tuning Required | N/A | None (training-free) | 0 training cost |
Key Takeaways
- Practitioners can deploy this as a drop-in inference optimization for long-context LLMs without any retraining, making it immediately applicable to memory-constrained production environments
- The off-GPU storage approach trades GPU memory for CPU memory and potential I/O latency — practitioners must profile restoration overhead in latency-sensitive applications before adoption
- The architecture-agnostic design means it can be evaluated on any transformer-based LLM, but the preliminary nature of experiments (single model, limited tasks) warrants caution before treating reported gains as generalizable
Abstract
We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.