UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

UniComp is a unified evaluation framework that benchmarks LLM compression techniques (pruning, quantization, and distillation) across performance, reliability, and efficiency dimensions using over 40 diverse datasets. It reveals systematic biases and trade-offs across compression methods to guide practical deployment decisions.

Problem Statement

Deploying LLMs at scale requires compression, but existing evaluations are fragmented, cover limited methods, and over-rely on knowledge-centric benchmarks that mask degradation in reasoning, safety, and multilingual capabilities. There is no comprehensive, hardware-aware framework that holistically compares pruning, quantization, and distillation under consistent conditions. This gap leads to poorly informed compression choices and overlooked failure modes in production systems.

Key Novelty

Unified multi-dimensional evaluation framework covering performance, reliability (safety), and hardware-aware efficiency across all three major compression paradigms simultaneously
Discovery and characterization of a 'knowledge bias' in LLM compression, where knowledge-intensive tasks are disproportionately preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially
Empirical finding that task-specific calibration data can recover up to 50% of reasoning ability lost in pruned models, providing actionable guidance for practitioners

Evaluation Highlights

Extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets spanning capability- and safety-oriented benchmarks, providing the broadest compression comparison to date
Task-specific calibration improves reasoning ability of pruned models by up to 50%; quantization offers the best overall performance-efficiency trade-off; distillation achieves strong runtime acceleration but at high computational training cost

Breakthrough Assessment

5/10 UniComp is a solid and practically valuable contribution that fills a real gap in systematic compression benchmarking, but it is primarily an empirical evaluation framework rather than a novel algorithmic or architectural advance. Its findings are actionable and its scope is broad, placing it as a strong community resource rather than a paradigm-shifting method.

Methodology

Select six representative compression techniques spanning pruning, quantization, and knowledge distillation, and apply them to modern LLMs under controlled conditions to ensure fair comparison
Evaluate compressed models across 40+ datasets organized into three dimensions: performance (capability benchmarks including reasoning, multilingual, instruction-following), reliability (safety-oriented benchmarks), and efficiency (hardware-aware latency/throughput analysis)
Analyze results to identify cross-cutting patterns such as the knowledge bias phenomenon, method-level trade-offs, and the impact of calibration dataset choice on downstream task recovery

System Components

Compression Method Suite

Six techniques covering the three major paradigms: structured/unstructured pruning, post-training quantization, and knowledge distillation, applied to modern LLMs

Performance Evaluation Module

Benchmarks assessing diverse capabilities including reasoning, multilingual understanding, and instruction-following using 40+ datasets

Reliability Evaluation Module

Safety-oriented benchmarks that assess how compression affects model behavior on sensitive or high-stakes outputs

Hardware-Aware Efficiency Analysis

Measures real-world latency, throughput, and memory footprint on hardware to quantify the practical efficiency gains of each compression method

Calibration Data Analysis

Examines how task-specific calibration datasets during compression (especially pruning) can recover degraded capabilities, particularly reasoning, by up to 50%

Results

Compression Method	Performance Retention	Efficiency Gain	Notable Trade-off
Quantization	Best overall retention	Strong hardware efficiency	Best overall performance-efficiency trade-off
Distillation	Competitive retention	Highest runtime acceleration	High computational training cost
Pruning (default calib.)	Significant reasoning drop	Moderate efficiency gain	Knowledge bias: reasoning degrades substantially
Pruning (task-specific calib.)	Up to 50% reasoning recovery	Moderate efficiency gain	Calibration data choice is critical
All methods (knowledge tasks)	Relatively preserved	Varies by method	Knowledge bias consistently observed across methods
All methods (multilingual/IF)	Substantial degradation	Varies by method	Underreported in prior narrow benchmarks

Key Takeaways

Quantization is the safest default choice for LLM compression when balancing retained model quality and inference efficiency, making it the recommended first option for most deployment scenarios
Standard knowledge-centric benchmarks (e.g., perplexity, MMLU) are insufficient for evaluating compressed models — practitioners must include reasoning, multilingual, instruction-following, and safety benchmarks to detect real capability degradation
When using pruning, carefully selecting task-specific calibration data can recover up to 50% of lost reasoning performance, meaning calibration strategy is as important as the compression algorithm itself

Abstract

Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.