Taming the Titans: A Survey of Efficient LLM Inference Serving

This survey provides a comprehensive taxonomy and analysis of efficient LLM inference serving techniques, spanning instance-level optimizations, cluster-level deployment strategies, and emerging scenario-specific methods to address the memory and computational bottlenecks of serving large language models at scale.

Problem Statement

LLMs impose massive memory overhead due to parameter counts and incur high computational costs from attention mechanisms, making low-latency, high-throughput serving extremely challenging. Existing literature lacks a unified, structured overview that bridges instance-level scheduling and storage optimizations with cluster-level orchestration and emerging use-case-specific techniques. Practitioners need consolidated guidance to navigate the rapidly expanding solution landscape for production LLM deployment.

Key Novelty

Introduces a multi-level taxonomy covering instance-level (model placement, request scheduling, decoding length prediction, KV cache management, prefill-decode disaggregation) and cluster-level (GPU cluster deployment, multi-instance load balancing, cloud solutions) optimization strategies in a single unified framework
Systematically organizes emerging scenario directions around specific tasks, modules, and auxiliary methods—capturing frontier topics like long-context serving, multi-modal inference, speculative decoding, and retrieval-augmented generation within the efficiency lens
Identifies and highlights niche but critical areas often overlooked in prior surveys (e.g., disaggregation paradigms, decoding length prediction) and outlines concrete future research directions for the community

Evaluation Highlights

Qualitative coverage: surveys methods across the full inference stack from single-instance kernel-level optimizations to multi-datacenter cloud orchestration, providing broader scope than prior surveys focused on model compression or single-system optimizations
Structural completeness: organizes 10+ distinct sub-areas (model placement, request scheduling, storage management, load balancing, cloud services, task-specific and module-specific emerging methods) into a coherent reference framework for ML practitioners

Breakthrough Assessment

5/10 This is a well-structured and timely survey that consolidates a rapidly evolving field into an actionable taxonomy, offering solid practical value; however, as a survey paper it introduces no new algorithms or empirical results, placing it in the solid contribution range rather than a paradigm-shifting advance.

Methodology

Decompose the LLM inference serving problem into hierarchical levels (instance vs. cluster) and identify canonical sub-problems within each level, then survey state-of-the-art solutions per sub-problem
Extend coverage to emerging scenarios by organizing recent work along three axes—task-specific optimizations (e.g., long-context, RAG, multi-modal), module-specific techniques (e.g., attention kernels, KV cache), and auxiliary methods (e.g., speculative decoding)—to capture frontier research
Synthesize findings into a holistic overview by highlighting underexplored but critical niche areas and distilling open research questions and future directions to guide subsequent work

System Components

Instance-Level Optimization

Covers model placement strategies, request scheduling policies, decoding length prediction, KV cache and memory storage management, and prefill-decode disaggregation to maximize single-node throughput and minimize latency

Cluster-Level Strategies

Addresses GPU cluster deployment configurations, multi-instance load balancing algorithms, and cloud-native service solutions to scale LLM serving across distributed infrastructure

Emerging Scenario Directions

Organizes task-specific (long-context, RAG, multi-modal), module-specific (attention, KV cache), and auxiliary methods (speculative decoding, quantization-aware serving) into a structured discussion of frontier challenges

Disaggregation Paradigm

Examines the architectural separation of prefill and decode phases across different hardware or instances to alleviate resource contention and improve overall system efficiency

Decoding Length Prediction

Reviews methods that forecast output sequence length to enable smarter scheduling, batching, and resource pre-allocation, reducing head-of-line blocking and wasted compute

Results

Coverage Dimension	Prior Surveys	This Survey	Delta
Instance-level scheduling & memory	Partial (model compression focus)	Comprehensive (5 sub-areas)	Broader scope
Cluster-level orchestration	Minimal	Full section (3 sub-areas)	New coverage
Emerging scenarios (multimodal, RAG, long-ctx)	Ad hoc or absent	Structured taxonomy	Systematic organization
Disaggregation & decode length prediction	Not covered	Dedicated treatment	Novel inclusion

Key Takeaways

Practitioners deploying LLMs at scale should consider prefill-decode disaggregation as an architectural pattern—separating compute-bound prefill from memory-bound decode phases can significantly improve hardware utilization and SLA compliance
Request scheduling and decoding length prediction are underinvested levers in most production systems; accurately forecasting output lengths enables smarter batching strategies that reduce tail latency without sacrificing throughput
Cluster-level load balancing and cloud-native autoscaling are as critical as model-level optimizations—even highly optimized single-instance serving can become a bottleneck if multi-instance orchestration is naive, making holistic system co-design essential

Abstract

Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.