Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
Problem Statement
Chain-of-Thought reasoning dramatically improves LLM performance but generates excessively long token sequences that impose high inference costs, limiting deployment efficiency. Existing CoT compression methods degrade in logical fidelity at high compression ratios, causing significant accuracy drops that make them impractical for aggressive compression scenarios. There is a need for methods that can maintain reasoning quality even under extreme token budget constraints.
Key Novelty
- A semantically-preserved compressor trained on fine-grained annotated mathematical CoT data that generates high-fidelity compressed supervision pairs for downstream fine-tuning
- Mixed-ratio Supervised Fine-Tuning (SFT) that teaches the LLM to operate across a spectrum of compression budgets, providing a stable initialization point before reinforcement learning
- Constrained and Hierarchical Ratio Policy Optimization (CHRPO), a novel RL method that uses a hierarchical reward structure to explicitly incentivize correct question-solving under increasingly aggressive token budget constraints
Evaluation Highlights
- On MATH-500 with Qwen3-1.7B, Extra-CoT achieves over 73% token reduction with a +0.6% accuracy improvement over baseline, outperforming SOTA compression methods
- Results demonstrated across three mathematical reasoning benchmarks, consistently showing superiority over existing CoT compression approaches at extreme compression ratios
Breakthrough Assessment
Methodology
- Step 1 — Train a dedicated semantically-preserved CoT compressor on mathematical data with fine-grained annotations to generate high-fidelity compressed CoT pairs at various compression ratios
- Step 2 — Fine-tune the target LLM on these compressed pairs using mixed-ratio SFT, exposing it to a range of token budgets to build multi-ratio compression capability and a stable RL initialization
- Step 3 — Apply Constrained and Hierarchical Ratio Policy Optimization (CHRPO) via reinforcement learning, using a hierarchical reward that explicitly incentivizes correct answers under lower token budgets to push extreme-ratio compression without accuracy collapse
System Components
A dedicated model trained with fine-grained CoT annotations to compress reasoning chains while retaining logical fidelity, generating reliable supervision data for the main LLM
A supervised fine-tuning stage that trains the LLM on compressed CoT pairs across multiple compression ratios simultaneously, teaching budget-aware reasoning and stabilizing RL training
A reinforcement learning algorithm that uses a hierarchical reward signal to progressively incentivize accurate reasoning under lower token budgets, explicitly optimizing the trade-off between compression and correctness
The end-to-end pipeline integrating the compressor, mixed-ratio SFT, and CHRPO to achieve extreme-ratio CoT compression with minimal accuracy loss
Results
| Benchmark / Model | Baseline (SOTA Compression) | Extra-CoT | Delta |
|---|---|---|---|
| MATH-500 / Qwen3-1.7B Accuracy | ~SOTA (lower) | +0.6% improvement | +0.6% accuracy |
| MATH-500 / Qwen3-1.7B Token Usage | Baseline token count | ~27% of baseline | -73% tokens |
| Mathematical Reasoning Benchmarks (3 total) | SOTA compression methods | Superior on all three | Consistent improvement |
Key Takeaways
- Practitioners can deploy reasoning LLMs at dramatically lower inference costs (>73% fewer tokens) without sacrificing accuracy by using a pipeline that combines a dedicated compressor, multi-ratio SFT, and constrained RL — offering a practical path to efficient on-device or API-cost-sensitive deployments
- The two-stage training strategy (SFT warm-up followed by RL fine-tuning) is key to stability at extreme compression ratios; skipping the mixed-ratio SFT initialization risks RL instability and accuracy collapse
- Fine-grained supervision from a dedicated semantic compressor is critical — using off-the-shelf summarization or naive truncation as supervision degrades logical fidelity, highlighting the importance of quality compressed training data generation
Abstract
Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.