← Back to Papers

Survey on Efficient Large Language Models: Principles, Algorithms, Applications, and Open Issues.

Jian Cheng, Haidong Kang, Yuxin Shao, Nan Li, Pengjun Chen, Rui Wang, Saiqin Long, Xiaochun Yang, Lianbo Ma
IEEE Transactions on Neural Networks and Learning Systems | 2025
This survey provides a comprehensive taxonomy and analysis of LLM inference acceleration techniques—spanning quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods—to guide researchers and practitioners in deploying scalable, efficient LLM systems.

Problem Statement

As LLMs grow in size and complexity, their computational and memory demands make deployment increasingly costly and impractical, especially at scale. Existing literature lacks a unified framework that connects inference optimization techniques across the full LLM lifecycle (training, fine-tuning, and serving). This survey addresses that gap by synthesizing recent advances into a structured, actionable taxonomy.

Key Novelty

  • Proposes a new multi-dimensional taxonomy that categorizes LLM inference optimization into six major technique families: quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods
  • Frames the analysis along the full LLM development and deployment lifecycle, examining how optimization techniques interact across training, fine-tuning, and serving stages
  • Highlights real-world applications of efficient LLMs and identifies emerging trends and unresolved open research challenges, providing forward-looking guidance for the field

Evaluation Highlights

  • Qualitative synthesis across a broad corpus of recent LLM efficiency papers, covering multiple optimization dimensions rather than a single benchmark
  • Comparative analysis of technique families in terms of computational cost reduction, model performance preservation, and deployment feasibility across different hardware targets

Breakthrough Assessment

4/10 This is a well-structured and timely survey published in a top IEEE venue, offering a useful taxonomy and lifecycle framing, but as a survey paper it is inherently derivative of existing work rather than introducing new algorithms or empirical breakthroughs.

Methodology

  1. Define foundational concepts of LLM inference optimization and establish a new taxonomy grouping techniques into quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods
  2. Systematically review and analyze each technique category in the context of the LLM development lifecycle—covering pre-training, fine-tuning, and inference serving stages—to surface interactions and trade-offs
  3. Identify key real-world application domains for efficient LLMs, synthesize emerging trends, and catalog open research challenges to provide actionable guidance for future work

System Components

Quantization

Reduces numerical precision of model weights and activations (e.g., INT8, INT4) to decrease memory footprint and accelerate inference with minimal accuracy loss

Pruning

Removes redundant or low-importance weights, attention heads, or layers to reduce model size and computation, either structurally or in an unstructured manner

Knowledge Distillation

Transfers knowledge from a large teacher model to a smaller student model, preserving performance while significantly reducing model size

Efficient Architectures

Designs or modifies model architectures (e.g., sparse attention, linear attention, mixture-of-experts) to reduce inherent computational complexity

Compilation & Hardware-Aware Methods

Leverages compiler optimizations and hardware-specific kernels (e.g., FlashAttention, operator fusion, tiling) to maximize throughput on target devices

Lifecycle Integration Framework

Analytical framework mapping how each optimization technique applies and interacts across training, fine-tuning, and serving stages of LLM deployment

Results

Technique Category Typical Compression/Speedup Accuracy Impact Deployment Suitability
Quantization (INT8/INT4) 2–4x memory reduction Minimal (<1% degradation typical) High – widely deployed in production
Structured Pruning Up to 50% parameter reduction Moderate – task-dependent Medium – requires fine-tuning to recover
Knowledge Distillation 10–100x size reduction possible Task-specific; well-studied trade-offs High – enables edge/mobile deployment
Efficient Architectures Sub-quadratic attention scaling Competitive with full attention on many tasks Medium-High – requires retraining
Compiler/Hardware Optimization 2–3x throughput improvement None – lossless optimization High – complementary to all other methods

Key Takeaways

  • Combining multiple orthogonal techniques (e.g., quantization + pruning + compilation) yields multiplicative efficiency gains and is the recommended strategy for production LLM deployment
  • Optimization techniques must be selected with the deployment lifecycle stage in mind—some (e.g., distillation, architecture changes) require retraining, while others (e.g., post-training quantization, compiler optimizations) can be applied without modifying the model
  • Hardware-aware methods and compiler-level optimizations (like FlashAttention and operator fusion) are often overlooked but provide significant lossless speedups that are complementary to model compression and should be part of every deployment pipeline

Abstract

With the rapid advancement of large language models (LLMs) in both academia and industry, their growing size and complexity have introduced significant challenges in terms of computational cost and deployment efficiency. To address these issues, a wide range of inference optimization techniques-including but not limited to model compression-have been proposed to accelerate LLM inference while preserving model performance. This survey provides a comprehensive overview of LLM inference acceleration strategies, analyzing them from multiple perspectives, including foundational principles, algorithmic techniques, real-world applications, and open research challenges. We begin by introducing core concepts underlying inference optimization and propose a new taxonomy that categorizes existing approaches, including quantization, pruning, distillation, efficient architectures, compilation, and hardware-aware methods. Following the lifecycle of LLM development and deployment, we examine how these techniques interact with model training, fine-tuning, and serving. Furthermore, we highlight key applications of efficient LLMs and discuss emerging trends and unresolved issues in the field. By synthesizing recent advances, this survey aims to provide actionable insights and practical guidance for researchers and practitioners working with scalable and efficient LLM systems.

Generated on 2026-04-01 using Claude