VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
Problem Statement
KV cache memory overhead is a critical bottleneck during LLM inference, especially for long-context scenarios. Existing vector quantization methods degrade severely at ultra-low bit-widths (e.g., 2-bit) because outliers in the key cache corrupt codebook utilization, wasting representational capacity. This limits practical deployment of aggressive compression without unacceptable accuracy loss.
Key Novelty
- Outlier suppression for key cache via combined smooth and Hadamard transformations, enabling effective codebook coverage at ultra-low bit-widths
- A novel vector quantization framework (VecInfer) specifically designed for aggressive KV cache compression at 2-bit precision with maintained accuracy
- Optimized CUDA kernel that fuses attention computation with dequantization, minimizing memory access overhead for efficient real-world deployment
Evaluation Highlights
- At 2-bit quantization on Llama-3.1-8B with 196k sequence length, VecInfer achieves up to 2.7× speedup in large-batch self-attention and 8.3× reduction in single-batch end-to-end latency
- VecInfer consistently outperforms existing quantization baselines on both long-context understanding and mathematical reasoning benchmarks, with 2-bit performance comparable to full precision
Breakthrough Assessment
Methodology
- Apply smooth transformation to the key cache to redistribute activation magnitudes and reduce per-channel outlier variance across the key vectors
- Apply a Hadamard transformation to further decorrelate and uniformly distribute residual outliers, making the data distribution more amenable to vector quantization codebook fitting
- Train a vector quantization codebook on the transformed key/value distributions, then implement a fused CUDA kernel that performs dequantization inline during attention computation to minimize memory bandwidth pressure
System Components
Rescales key cache activations to suppress large per-channel outliers, reducing the dynamic range that the codebook must cover
Orthogonal transform applied after smoothing to further decorrelate and spread residual outliers uniformly across dimensions, improving codebook utilization
Learned codebook that maps groups of key/value vectors to low-bit codes; benefits from outlier suppression to achieve accurate reconstruction at 2-bit precision
Custom GPU kernel that integrates dequantization of the compressed KV cache with self-attention computation, reducing memory traffic and improving throughput
Results
| Metric/Benchmark | Baseline (Full Precision / Prior VQ) | VecInfer (2-bit) | Delta |
|---|---|---|---|
| Self-attention speedup (large-batch, 196k ctx) | 1× | 2.7× | +170% |
| End-to-end latency reduction (single-batch, 196k ctx) | 1× | 8.3× faster | +730% |
| Long-context understanding accuracy | Baseline VQ degrades significantly at 2-bit | Comparable to full precision | Closes accuracy gap |
| Mathematical reasoning accuracy | Baseline VQ degrades significantly at 2-bit | Comparable to full precision | Closes accuracy gap |
Key Takeaways
- Outlier suppression (smooth + Hadamard transforms) is a critical preprocessing step before applying vector quantization to KV caches — without it, 2-bit compression causes severe accuracy degradation that limits practical use
- 2-bit KV cache quantization with VecInfer is viable for production long-context inference (up to 196k tokens), offering an 8× latency reduction that can dramatically reduce serving costs for memory-bound workloads
- Fusing dequantization with attention computation in custom CUDA kernels is essential to realize the theoretical memory savings of KV cache compression as actual wall-clock speedups on real hardware
Abstract
The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.