Survey on Efficient Large Language Models: Principles, Algorithms, Applications, and Open Issues.
Problem Statement
As LLMs grow in size and complexity, their computational and memory demands make deployment increasingly costly and impractical, especially at scale. Existing literature lacks a unified framework that connects inference optimization techniques across the full LLM lifecycle (training, fine-tuning, and serving). This survey addresses that gap by synthesizing recent advances into a structured, actionable taxonomy.
Key Novelty
- Proposes a new multi-dimensional taxonomy that categorizes LLM inference optimization into six major technique families: quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods
- Frames the analysis along the full LLM development and deployment lifecycle, examining how optimization techniques interact across training, fine-tuning, and serving stages
- Highlights real-world applications of efficient LLMs and identifies emerging trends and unresolved open research challenges, providing forward-looking guidance for the field
Evaluation Highlights
- Qualitative synthesis across a broad corpus of recent LLM efficiency papers, covering multiple optimization dimensions rather than a single benchmark
- Comparative analysis of technique families in terms of computational cost reduction, model performance preservation, and deployment feasibility across different hardware targets
Breakthrough Assessment
Methodology
- Define foundational concepts of LLM inference optimization and establish a new taxonomy grouping techniques into quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods
- Systematically review and analyze each technique category in the context of the LLM development lifecycle—covering pre-training, fine-tuning, and inference serving stages—to surface interactions and trade-offs
- Identify key real-world application domains for efficient LLMs, synthesize emerging trends, and catalog open research challenges to provide actionable guidance for future work
System Components
Reduces numerical precision of model weights and activations (e.g., INT8, INT4) to decrease memory footprint and accelerate inference with minimal accuracy loss
Removes redundant or low-importance weights, attention heads, or layers to reduce model size and computation, either structurally or in an unstructured manner
Transfers knowledge from a large teacher model to a smaller student model, preserving performance while significantly reducing model size
Designs or modifies model architectures (e.g., sparse attention, linear attention, mixture-of-experts) to reduce inherent computational complexity
Leverages compiler optimizations and hardware-specific kernels (e.g., FlashAttention, operator fusion, tiling) to maximize throughput on target devices
Analytical framework mapping how each optimization technique applies and interacts across training, fine-tuning, and serving stages of LLM deployment
Results
| Technique Category | Typical Compression/Speedup | Accuracy Impact | Deployment Suitability |
|---|---|---|---|
| Quantization (INT8/INT4) | 2–4x memory reduction | Minimal (<1% degradation typical) | High – widely deployed in production |
| Structured Pruning | Up to 50% parameter reduction | Moderate – task-dependent | Medium – requires fine-tuning to recover |
| Knowledge Distillation | 10–100x size reduction possible | Task-specific; well-studied trade-offs | High – enables edge/mobile deployment |
| Efficient Architectures | Sub-quadratic attention scaling | Competitive with full attention on many tasks | Medium-High – requires retraining |
| Compiler/Hardware Optimization | 2–3x throughput improvement | None – lossless optimization | High – complementary to all other methods |
Key Takeaways
- Combining multiple orthogonal techniques (e.g., quantization + pruning + compilation) yields multiplicative efficiency gains and is the recommended strategy for production LLM deployment
- Optimization techniques must be selected with the deployment lifecycle stage in mind—some (e.g., distillation, architecture changes) require retraining, while others (e.g., post-training quantization, compiler optimizations) can be applied without modifying the model
- Hardware-aware methods and compiler-level optimizations (like FlashAttention and operator fusion) are often overlooked but provide significant lossless speedups that are complementary to model compression and should be part of every deployment pipeline
Abstract
With the rapid advancement of large language models (LLMs) in both academia and industry, their growing size and complexity have introduced significant challenges in terms of computational cost and deployment efficiency. To address these issues, a wide range of inference optimization techniques-including but not limited to model compression-have been proposed to accelerate LLM inference while preserving model performance. This survey provides a comprehensive overview of LLM inference acceleration strategies, analyzing them from multiple perspectives, including foundational principles, algorithmic techniques, real-world applications, and open research challenges. We begin by introducing core concepts underlying inference optimization and propose a new taxonomy that categorizes existing approaches, including quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods. Following the lifecycle of LLM development and deployment, we examine how these techniques interact with model training, fine-tuning, and serving. Furthermore, we highlight key applications of efficient LLMs and discuss emerging trends and unresolved issues in the field. By synthesizing recent advances, this survey aims to provide actionable insights and practical guidance for researchers and practitioners working with scalable and efficient LLM systems.