Taming the Titans: A Survey of Efficient LLM Inference Serving
Problem Statement
LLMs impose massive memory overhead due to parameter counts and incur high computational costs from attention mechanisms, making low-latency, high-throughput serving extremely challenging. Existing literature lacks a unified, structured overview that bridges instance-level scheduling and storage optimizations with cluster-level orchestration and emerging use-case-specific techniques. Practitioners need consolidated guidance to navigate the rapidly expanding solution landscape for production LLM deployment.
Key Novelty
- Introduces a multi-level taxonomy covering instance-level (model placement, request scheduling, decoding length prediction, KV cache management, prefill-decode disaggregation) and cluster-level (GPU cluster deployment, multi-instance load balancing, cloud solutions) optimization strategies in a single unified framework
- Systematically organizes emerging scenario directions around specific tasks, modules, and auxiliary methods—capturing frontier topics like long-context serving, multi-modal inference, speculative decoding, and retrieval-augmented generation within the efficiency lens
- Identifies and highlights niche but critical areas often overlooked in prior surveys (e.g., disaggregation paradigms, decoding length prediction) and outlines concrete future research directions for the community
Evaluation Highlights
- Qualitative coverage: surveys methods across the full inference stack from single-instance kernel-level optimizations to multi-datacenter cloud orchestration, providing broader scope than prior surveys focused on model compression or single-system optimizations
- Structural completeness: organizes 10+ distinct sub-areas (model placement, request scheduling, storage management, load balancing, cloud services, task-specific and module-specific emerging methods) into a coherent reference framework for ML practitioners
Breakthrough Assessment
Methodology
- Decompose the LLM inference serving problem into hierarchical levels (instance vs. cluster) and identify canonical sub-problems within each level, then survey state-of-the-art solutions per sub-problem
- Extend coverage to emerging scenarios by organizing recent work along three axes—task-specific optimizations (e.g., long-context, RAG, multi-modal), module-specific techniques (e.g., attention kernels, KV cache), and auxiliary methods (e.g., speculative decoding)—to capture frontier research
- Synthesize findings into a holistic overview by highlighting underexplored but critical niche areas and distilling open research questions and future directions to guide subsequent work
System Components
Covers model placement strategies, request scheduling policies, decoding length prediction, KV cache and memory storage management, and prefill-decode disaggregation to maximize single-node throughput and minimize latency
Addresses GPU cluster deployment configurations, multi-instance load balancing algorithms, and cloud-native service solutions to scale LLM serving across distributed infrastructure
Organizes task-specific (long-context, RAG, multi-modal), module-specific (attention, KV cache), and auxiliary methods (speculative decoding, quantization-aware serving) into a structured discussion of frontier challenges
Examines the architectural separation of prefill and decode phases across different hardware or instances to alleviate resource contention and improve overall system efficiency
Reviews methods that forecast output sequence length to enable smarter scheduling, batching, and resource pre-allocation, reducing head-of-line blocking and wasted compute
Results
| Coverage Dimension | Prior Surveys | This Survey | Delta |
|---|---|---|---|
| Instance-level scheduling & memory | Partial (model compression focus) | Comprehensive (5 sub-areas) | Broader scope |
| Cluster-level orchestration | Minimal | Full section (3 sub-areas) | New coverage |
| Emerging scenarios (multimodal, RAG, long-ctx) | Ad hoc or absent | Structured taxonomy | Systematic organization |
| Disaggregation & decode length prediction | Not covered | Dedicated treatment | Novel inclusion |
Key Takeaways
- Practitioners deploying LLMs at scale should consider prefill-decode disaggregation as an architectural pattern—separating compute-bound prefill from memory-bound decode phases can significantly improve hardware utilization and SLA compliance
- Request scheduling and decoding length prediction are underinvested levers in most production systems; accurately forecasting output lengths enables smarter batching strategies that reduce tail latency without sacrificing throughput
- Cluster-level load balancing and cloud-native autoscaling are as critical as model-level optimizations—even highly optimized single-instance serving can become a bottleneck if multi-instance orchestration is naive, making holistic system co-design essential
Abstract
Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.