← Back to Papers

Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Adilet Metinov, Gulida M. Kudakeeva, Bolotbek uulu Nursultan, G. Kabaeva
arXiv.org | 2025
ASR-KF-EGR is a training-free inference framework that temporarily freezes KV cache updates for low-importance tokens using entropy-guided detection and sublinear scheduling, achieving sublinear memory growth without permanently discarding context.

Problem Statement

Long-context LLM inference is bottlenecked by KV cache memory, which grows linearly with sequence length and exhausts GPU memory. Existing eviction-based approaches (e.g., StreamingLLM, H2O) permanently discard tokens, risking irreversible information loss. A reversible, memory-efficient alternative that preserves full context fidelity is needed for practical deployment.

Key Novelty

  • Reversible soft-freeze mechanism that suspends KV updates for low-importance tokens while preserving them in off-GPU storage for on-demand restoration, unlike permanent eviction methods
  • Entropy-guided importance detection within a sliding attention window to identify which tokens can be safely frozen without quality degradation
  • Sublinear freeze scheduling where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression and balancing memory savings with fidelity

Evaluation Highlights

  • 55-67% reduction in active KV cache size on LLaMA-3 8B while maintaining generation quality
  • Passes needle-in-haystack retrieval tests, demonstrating preserved long-range context retrieval despite cache compression

Breakthrough Assessment

4/10 The reversible soft-freeze concept is a meaningful conceptual improvement over eviction-based methods, but the evaluation is preliminary (single model, limited benchmarks) and the off-GPU storage trade-off introduces latency concerns that are not fully addressed, placing this as a solid early-stage contribution rather than a significant advance.

Methodology

  1. At each generation step, compute attention entropy within a sliding window to score token importance; low-entropy (low-attention-variance) tokens are flagged as candidates for freezing
  2. Apply the soft-freeze: suspend active KV cache updates for flagged tokens and offload their KV entries to CPU/off-GPU storage, freeing GPU memory while retaining recoverability
  3. Use sublinear freeze scheduling to determine freeze duration per token based on repeated low-importance detections; restore frozen tokens on demand via entropy-guided recovery when their importance rises above threshold

System Components

Sliding Attention Window

A local window over recent tokens used to compute per-token attention entropy scores that proxy importance for freeze decisions

Entropy-Guided Importance Scorer

Measures attention entropy per token to identify low-importance tokens suitable for KV freezing, and triggers recovery when entropy indicates renewed relevance

Soft-Freeze Mechanism

Reversibly suspends KV cache updates for low-importance tokens by offloading their entries to off-GPU (CPU) storage instead of permanently deleting them

Sublinear Freeze Scheduler

Controls how long a token remains frozen using a sublinear function of repeated low-importance detections, preventing runaway compression of frequently low-scored tokens

On-Demand Recovery Module

Restores frozen KV entries from off-GPU storage back to GPU when the entropy-guided scorer determines a previously frozen token has become important again

Results

Metric/Benchmark Baseline (Full KV Cache) ASR-KF-EGR Delta
Active KV Cache Size (LLaMA-3 8B) 100% (full cache) 33-45% of full cache -55% to -67%
Needle-in-Haystack Retrieval Pass Pass No degradation
Generation Quality Reference quality Maintained (qualitative) No significant loss
Fine-tuning Required N/A None (training-free) 0 training cost

Key Takeaways

  • Practitioners can deploy this as a drop-in inference optimization for long-context LLMs without any retraining, making it immediately applicable to memory-constrained production environments
  • The off-GPU storage approach trades GPU memory for CPU memory and potential I/O latency — practitioners must profile restoration overhead in latency-sensitive applications before adoption
  • The architecture-agnostic design means it can be evaluated on any transformer-based LLM, but the preliminary nature of experiments (single model, limited tasks) warrants caution before treating reported gains as generalizable

Abstract

We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.

Generated on 2026-03-03 using Claude