LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

LMCache is an open-source KV cache management layer that offloads and shares KV caches across GPU, CPU, storage, and network layers, enabling efficient cache reuse across different queries and inference engines at enterprise scale.

Problem Statement

KV caches in LLM inference are traditionally confined to GPU memory, which limits reuse across queries and inference engines. As enterprise deployments grow, cumulative KV cache storage far exceeds GPU memory capacity, yet no efficient solution exists for offloading and transferring these caches. This bottleneck causes redundant prefill computation, high latency, and poor throughput in multi-round and document-heavy workloads.

Key Novelty

First open-source KV cache offloading and sharing solution that integrates with major inference engines (vLLM and SGLang) via a modular connector component decoupled from engine internals
Highly optimized KV cache data movement using batched operations, compute/IO pipelining to minimize transfer overhead across GPU, CPU, disk, and network tiers
First-class control API for flexible cache orchestration enabling both prefix reuse across queries and prefill-decode (PD) disaggregation across engines/GPUs

Evaluation Highlights

Up to 15x improvement in throughput when combining LMCache with vLLM on workloads such as multi-round question answering and document analysis
Real-world enterprise deployment insights reveal that context truncation—a widely used industry technique—can reduce prefix cache hit ratio by up to 50%, and remote KV cache fetching meaningfully reduces prefill delay

Breakthrough Assessment

7/10 LMCache addresses a critical and timely infrastructure gap in enterprise LLM deployment with a well-engineered, production-validated system achieving large throughput gains; while not a new algorithmic paradigm, its practical impact, open-source availability, and real-world adoption insights make it a significant systems contribution.

Methodology

Intercept KV cache tensors generated by LLM inference engines (vLLM, SGLang) using a modular KV cache connector that abstracts engine-specific internals, enabling cache extraction without tight coupling
Store and tier KV caches across GPU memory, CPU memory, local storage, and remote network storage using batched data movement with compute/IO pipelining to minimize latency overhead
Expose a control API that enables cache orchestration policies such as prefix-based cache lookup for query reuse and cross-engine KV cache transfer for prefill-decode disaggregation

System Components

KV Cache Connector

Modular adapter that interfaces LMCache with inference engines (vLLM, SGLang), decoupling the caching layer from rapid engine evolution

Tiered Cache Storage

Hierarchical storage backend spanning GPU memory, CPU memory, local disk, and remote storage, with policies for eviction and retrieval

Batched Data Movement Engine

Optimized pipeline for moving KV cache tensors between memory tiers using batching and overlapped compute/IO to reduce transfer latency

Control API

First-class programmatic interface for orchestrating cache placement, retrieval, and sharing across storage and network layers

PD Disaggregation Support

Mechanism for transferring KV caches between prefill and decode engines across GPUs/nodes to support disaggregated inference architectures

Results

Metric/Benchmark	Baseline (vLLM alone)	This Paper (LMCache+vLLM)	Delta
Throughput (multi-round QA)	1x	Up to 15x	+15x
Throughput (document analysis)	1x	Up to 15x	+15x
Prefill delay (remote KV fetch)	Full recomputation	Reduced via cache hit	Significant reduction
Prefix cache hit ratio (with context truncation)	Baseline hit ratio	~50% lower hit ratio	-50% (insight: truncation hurts caching)

Key Takeaways

Context truncation, commonly used in production to manage long contexts, can halve prefix cache hit ratios—teams should carefully weigh its trade-off against caching efficiency when deploying LMCache or similar systems
KV cache offloading to remote storage provides meaningful prefill latency benefits, making it viable to trade bandwidth for GPU memory savings in enterprise multi-tenant deployments
Decoupling the KV cache layer from inference engine internals via a modular connector is architecturally critical for long-term maintainability as vLLM, SGLang, and other engines evolve rapidly

Abstract

KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across different queries and inference engines. Our real-world usage statistics confirm this trend: over time, the total KV cache stored by users has grown rapidly, far exceeding the capacity of GPU memory. Despite this need, there lacks an efficient solution for offloading and transferring KV caches. We present LMCACHE, the first and so far the most efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) out of the GPU memory and shares them across engines and queries. LMCACHE supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). LMCACHE's high performance and wide adoption stem from the following contributions: (1) highly optimized KV cache data movement powered by batched data movement operations, compute and I/O pipelining; (2) a modular KV cache connector component, decoupling LMCACHE from the rapid evolution of inference engines; (3) a first-class control API for flexible cache orchestration across GPU, CPU, storage, and network layers. Our evaluation shows that combining LMCACHE with vLLM achieves up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis. Large-scale adoption of LMCACHE in enterprise settings provides us valuable insights, for example, fetching KV cache from remote storage has unsurprisingly benefits to prefill delay, and that context truncation, which is a widely applied technique in industry, can greatly reduce prefix cache hit ratio by half. The source code of LMCACHE is at: https://github.com/LMCache/LMCache.