← Back to Papers

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang
arXiv.org | 2025
VecInfer is a vector quantization method for KV cache compression in LLMs that suppresses key cache outliers via smooth and Hadamard transformations, enabling aggressive 2-bit quantization with near-full-precision performance and significant inference speedups.

Problem Statement

KV cache memory overhead is a critical bottleneck during LLM inference, especially for long-context scenarios. Existing vector quantization methods degrade severely at ultra-low bit-widths (e.g., 2-bit) because outliers in the key cache corrupt codebook utilization, wasting representational capacity. This limits practical deployment of aggressive compression without unacceptable accuracy loss.

Key Novelty

  • Outlier suppression for key cache via combined smooth and Hadamard transformations, enabling effective codebook coverage at ultra-low bit-widths
  • A novel vector quantization framework (VecInfer) specifically designed for aggressive KV cache compression at 2-bit precision with maintained accuracy
  • Optimized CUDA kernel that fuses attention computation with dequantization, minimizing memory access overhead for efficient real-world deployment

Evaluation Highlights

  • At 2-bit quantization on Llama-3.1-8B with 196k sequence length, VecInfer achieves up to 2.7× speedup in large-batch self-attention and 8.3× reduction in single-batch end-to-end latency
  • VecInfer consistently outperforms existing quantization baselines on both long-context understanding and mathematical reasoning benchmarks, with 2-bit performance comparable to full precision

Breakthrough Assessment

7/10 VecInfer makes a significant practical advance by solving a well-known failure mode of vector quantization at ultra-low bit-widths, achieving near-lossless 2-bit KV cache compression with hardware-efficient kernels — directly enabling long-context LLM deployment on memory-constrained hardware.

Methodology

  1. Apply smooth transformation to the key cache to redistribute activation magnitudes and reduce per-channel outlier variance across the key vectors
  2. Apply a Hadamard transformation to further decorrelate and uniformly distribute residual outliers, making the data distribution more amenable to vector quantization codebook fitting
  3. Train a vector quantization codebook on the transformed key/value distributions, then implement a fused CUDA kernel that performs dequantization inline during attention computation to minimize memory bandwidth pressure

System Components

Smooth Transformation

Rescales key cache activations to suppress large per-channel outliers, reducing the dynamic range that the codebook must cover

Hadamard Transformation

Orthogonal transform applied after smoothing to further decorrelate and spread residual outliers uniformly across dimensions, improving codebook utilization

Vector Quantization Codebook

Learned codebook that maps groups of key/value vectors to low-bit codes; benefits from outlier suppression to achieve accurate reconstruction at 2-bit precision

Fused CUDA Kernel

Custom GPU kernel that integrates dequantization of the compressed KV cache with self-attention computation, reducing memory traffic and improving throughput

Results

Metric/Benchmark Baseline (Full Precision / Prior VQ) VecInfer (2-bit) Delta
Self-attention speedup (large-batch, 196k ctx) 2.7× +170%
End-to-end latency reduction (single-batch, 196k ctx) 8.3× faster +730%
Long-context understanding accuracy Baseline VQ degrades significantly at 2-bit Comparable to full precision Closes accuracy gap
Mathematical reasoning accuracy Baseline VQ degrades significantly at 2-bit Comparable to full precision Closes accuracy gap

Key Takeaways

  • Outlier suppression (smooth + Hadamard transforms) is a critical preprocessing step before applying vector quantization to KV caches — without it, 2-bit compression causes severe accuracy degradation that limits practical use
  • 2-bit KV cache quantization with VecInfer is viable for production long-context inference (up to 196k tokens), offering an 8× latency reduction that can dramatically reduce serving costs for memory-bound workloads
  • Fusing dequantization with attention computation in custom CUDA kernels is essential to realize the theoretical memory savings of KV cache compression as actual wall-clock speedups on real hardware

Abstract

The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

Generated on 2026-04-01 using Claude