UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation
Problem Statement
Deploying LLMs at scale requires compression, but existing evaluations are fragmented, cover limited methods, and over-rely on knowledge-centric benchmarks that mask degradation in reasoning, safety, and multilingual capabilities. There is no comprehensive, hardware-aware framework that holistically compares pruning, quantization, and distillation under consistent conditions. This gap leads to poorly informed compression choices and overlooked failure modes in production systems.
Key Novelty
- Unified multi-dimensional evaluation framework covering performance, reliability (safety), and hardware-aware efficiency across all three major compression paradigms simultaneously
- Discovery and characterization of a 'knowledge bias' in LLM compression, where knowledge-intensive tasks are disproportionately preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially
- Empirical finding that task-specific calibration data can recover up to 50% of reasoning ability lost in pruned models, providing actionable guidance for practitioners
Evaluation Highlights
- Extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets spanning capability- and safety-oriented benchmarks, providing the broadest compression comparison to date
- Task-specific calibration improves reasoning ability of pruned models by up to 50%; quantization offers the best overall performance-efficiency trade-off; distillation achieves strong runtime acceleration but at high computational training cost
Breakthrough Assessment
Methodology
- Select six representative compression techniques spanning pruning, quantization, and knowledge distillation, and apply them to modern LLMs under controlled conditions to ensure fair comparison
- Evaluate compressed models across 40+ datasets organized into three dimensions: performance (capability benchmarks including reasoning, multilingual, instruction-following), reliability (safety-oriented benchmarks), and efficiency (hardware-aware latency/throughput analysis)
- Analyze results to identify cross-cutting patterns such as the knowledge bias phenomenon, method-level trade-offs, and the impact of calibration dataset choice on downstream task recovery
System Components
Six techniques covering the three major paradigms: structured/unstructured pruning, post-training quantization, and knowledge distillation, applied to modern LLMs
Benchmarks assessing diverse capabilities including reasoning, multilingual understanding, and instruction-following using 40+ datasets
Safety-oriented benchmarks that assess how compression affects model behavior on sensitive or high-stakes outputs
Measures real-world latency, throughput, and memory footprint on hardware to quantify the practical efficiency gains of each compression method
Examines how task-specific calibration datasets during compression (especially pruning) can recover degraded capabilities, particularly reasoning, by up to 50%
Results
| Compression Method | Performance Retention | Efficiency Gain | Notable Trade-off |
|---|---|---|---|
| Quantization | Best overall retention | Strong hardware efficiency | Best overall performance-efficiency trade-off |
| Distillation | Competitive retention | Highest runtime acceleration | High computational training cost |
| Pruning (default calib.) | Significant reasoning drop | Moderate efficiency gain | Knowledge bias: reasoning degrades substantially |
| Pruning (task-specific calib.) | Up to 50% reasoning recovery | Moderate efficiency gain | Calibration data choice is critical |
| All methods (knowledge tasks) | Relatively preserved | Varies by method | Knowledge bias consistently observed across methods |
| All methods (multilingual/IF) | Substantial degradation | Varies by method | Underreported in prior narrow benchmarks |
Key Takeaways
- Quantization is the safest default choice for LLM compression when balancing retained model quality and inference efficiency, making it the recommended first option for most deployment scenarios
- Standard knowledge-centric benchmarks (e.g., perplexity, MMLU) are insufficient for evaluating compressed models — practitioners must include reasoning, multilingual, instruction-following, and safety benchmarks to detect real capability degradation
- When using pruning, carefully selecting task-specific calibration data can recover up to 50% of lost reasoning performance, meaning calibration strategy is as important as the compression algorithm itself
Abstract
Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.