LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
Problem Statement
KV caches in LLM inference are traditionally confined to GPU memory, which limits reuse across queries and inference engines. As enterprise deployments grow, cumulative KV cache storage far exceeds GPU memory capacity, yet no efficient solution exists for offloading and transferring these caches. This bottleneck causes redundant prefill computation, high latency, and poor throughput in multi-round and document-heavy workloads.
Key Novelty
- First open-source KV cache offloading and sharing solution that integrates with major inference engines (vLLM and SGLang) via a modular connector component decoupled from engine internals
- Highly optimized KV cache data movement using batched operations, compute/IO pipelining to minimize transfer overhead across GPU, CPU, disk, and network tiers
- First-class control API for flexible cache orchestration enabling both prefix reuse across queries and prefill-decode (PD) disaggregation across engines/GPUs
Evaluation Highlights
- Up to 15x improvement in throughput when combining LMCache with vLLM on workloads such as multi-round question answering and document analysis
- Real-world enterprise deployment insights reveal that context truncation—a widely used industry technique—can reduce prefix cache hit ratio by up to 50%, and remote KV cache fetching meaningfully reduces prefill delay
Breakthrough Assessment
Methodology
- Intercept KV cache tensors generated by LLM inference engines (vLLM, SGLang) using a modular KV cache connector that abstracts engine-specific internals, enabling cache extraction without tight coupling
- Store and tier KV caches across GPU memory, CPU memory, local storage, and remote network storage using batched data movement with compute/IO pipelining to minimize latency overhead
- Expose a control API that enables cache orchestration policies such as prefix-based cache lookup for query reuse and cross-engine KV cache transfer for prefill-decode disaggregation
System Components
Modular adapter that interfaces LMCache with inference engines (vLLM, SGLang), decoupling the caching layer from rapid engine evolution
Hierarchical storage backend spanning GPU memory, CPU memory, local disk, and remote storage, with policies for eviction and retrieval
Optimized pipeline for moving KV cache tensors between memory tiers using batching and overlapped compute/IO to reduce transfer latency
First-class programmatic interface for orchestrating cache placement, retrieval, and sharing across storage and network layers
Mechanism for transferring KV caches between prefill and decode engines across GPUs/nodes to support disaggregated inference architectures
Results
| Metric/Benchmark | Baseline (vLLM alone) | This Paper (LMCache+vLLM) | Delta |
|---|---|---|---|
| Throughput (multi-round QA) | 1x | Up to 15x | +15x |
| Throughput (document analysis) | 1x | Up to 15x | +15x |
| Prefill delay (remote KV fetch) | Full recomputation | Reduced via cache hit | Significant reduction |
| Prefix cache hit ratio (with context truncation) | Baseline hit ratio | ~50% lower hit ratio | -50% (insight: truncation hurts caching) |
Key Takeaways
- Context truncation, commonly used in production to manage long contexts, can halve prefix cache hit ratios—teams should carefully weigh its trade-off against caching efficiency when deploying LMCache or similar systems
- KV cache offloading to remote storage provides meaningful prefill latency benefits, making it viable to trade bandwidth for GPU memory savings in enterprise multi-tenant deployments
- Decoupling the KV cache layer from inference engine internals via a modular connector is architecturally critical for long-term maintainability as vLLM, SGLang, and other engines evolve rapidly
Abstract
KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across different queries and inference engines. Our real-world usage statistics confirm this trend: over time, the total KV cache stored by users has grown rapidly, far exceeding the capacity of GPU memory. Despite this need, there lacks an efficient solution for offloading and transferring KV caches. We present LMCACHE, the first and so far the most efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) out of the GPU memory and shares them across engines and queries. LMCACHE supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). LMCACHE's high performance and wide adoption stem from the following contributions: (1) highly optimized KV cache data movement powered by batched data movement operations, compute and I/O pipelining; (2) a modular KV cache connector component, decoupling LMCACHE from the rapid evolution of inference engines; (3) a first-class control API for flexible cache orchestration across GPU, CPU, storage, and network layers. Our evaluation shows that combining LMCACHE with vLLM achieves up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis. Large-scale adoption of LMCACHE in enterprise settings provides us valuable insights, for example, fetching KV cache from remote storage has unsurprisingly benefits to prefill delay, and that context truncation, which is a widely applied technique in industry, can greatly reduce prefix cache hit ratio by half. The source code of LMCACHE is at: https://github.com/LMCache/LMCache.