Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

Extra-CoT is a framework that aggressively compresses Chain-of-Thought reasoning tokens in LLMs by over 73% while maintaining or improving answer accuracy through a combination of a dedicated semantic compressor, mixed-ratio SFT, and a novel constrained hierarchical RL policy optimization.

Problem Statement

Chain-of-Thought reasoning dramatically improves LLM performance but generates excessively long token sequences that impose high inference costs, limiting deployment efficiency. Existing CoT compression methods degrade in logical fidelity at high compression ratios, causing significant accuracy drops that make them impractical for aggressive compression scenarios. There is a need for methods that can maintain reasoning quality even under extreme token budget constraints.

Key Novelty

A semantically-preserved compressor trained on fine-grained annotated mathematical CoT data that generates high-fidelity compressed supervision pairs for downstream fine-tuning
Mixed-ratio Supervised Fine-Tuning (SFT) that teaches the LLM to operate across a spectrum of compression budgets, providing a stable initialization point before reinforcement learning
Constrained and Hierarchical Ratio Policy Optimization (CHRPO), a novel RL method that uses a hierarchical reward structure to explicitly incentivize correct question-solving under increasingly aggressive token budget constraints

Evaluation Highlights

On MATH-500 with Qwen3-1.7B, Extra-CoT achieves over 73% token reduction with a +0.6% accuracy improvement over baseline, outperforming SOTA compression methods
Results demonstrated across three mathematical reasoning benchmarks, consistently showing superiority over existing CoT compression approaches at extreme compression ratios

Signal Assessment

6/10 The paper presents a solid, well-engineered contribution combining a dedicated compressor, curriculum-style SFT, and constrained RL into a coherent pipeline that pushes the Pareto frontier of CoT compression vs. accuracy. However, it is primarily an engineering advance within an established paradigm (CoT + RL) rather than a fundamental algorithmic breakthrough.

Methodology

Step 1 — Train a dedicated semantically-preserved CoT compressor on mathematical data with fine-grained annotations to generate high-fidelity compressed CoT pairs at various compression ratios
Step 2 — Fine-tune the target LLM on these compressed pairs using mixed-ratio SFT, exposing it to a range of token budgets to build multi-ratio compression capability and a stable RL initialization
Step 3 — Apply Constrained and Hierarchical Ratio Policy Optimization (CHRPO) via reinforcement learning, using a hierarchical reward that explicitly incentivizes correct answers under lower token budgets to push extreme-ratio compression without accuracy collapse

System Components

Semantically-Preserved Compressor

A dedicated model trained with fine-grained CoT annotations to compress reasoning chains while retaining logical fidelity, generating reliable supervision data for the main LLM

Mixed-Ratio SFT

A supervised fine-tuning stage that trains the LLM on compressed CoT pairs across multiple compression ratios simultaneously, teaching budget-aware reasoning and stabilizing RL training

CHRPO (Constrained and Hierarchical Ratio Policy Optimization)

A reinforcement learning algorithm that uses a hierarchical reward signal to progressively incentivize accurate reasoning under lower token budgets, explicitly optimizing the trade-off between compression and correctness

Extra-CoT Framework

The end-to-end pipeline integrating the compressor, mixed-ratio SFT, and CHRPO to achieve extreme-ratio CoT compression with minimal accuracy loss

Results

Benchmark / Model	Baseline (SOTA Compression)	Extra-CoT	Delta
MATH-500 / Qwen3-1.7B Accuracy	~SOTA (lower)	+0.6% improvement	+0.6% accuracy
MATH-500 / Qwen3-1.7B Token Usage	Baseline token count	~27% of baseline	-73% tokens
Mathematical Reasoning Benchmarks (3 total)	SOTA compression methods	Superior on all three	Consistent improvement

Key Takeaways

Practitioners can deploy reasoning LLMs at dramatically lower inference costs (>73% fewer tokens) without sacrificing accuracy by using a pipeline that combines a dedicated compressor, multi-ratio SFT, and constrained RL — offering a practical path to efficient on-device or API-cost-sensitive deployments
The two-stage training strategy (SFT warm-up followed by RL fine-tuning) is key to stability at extreme compression ratios; skipping the mixed-ratio SFT initialization risks RL instability and accuracy collapse
Fine-grained supervision from a dedicated semantic compressor is critical — using off-the-shelf summarization or naive truncation as supervision degrades logical fidelity, highlighting the importance of quality compressed training data generation

Abstract

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.