Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy

Step entropy quantifies the informational contribution of individual Chain-of-Thought reasoning steps, enabling up to 80% of redundant low-entropy steps to be pruned and LLMs to be trained to autonomously generate compressed CoTs using [SKIP] tokens.

Problem Statement

LLMs with Chain-of-Thought prompting generate verbose, redundant reasoning traces that inflate inference costs and reduce deployment scalability. Existing approaches lack principled metrics to identify which reasoning steps are truly redundant versus informationally critical. Blind or random pruning severely degrades reasoning performance, highlighting the need for a theoretically grounded compression strategy.

Key Novelty

Introduction of 'step entropy' as a principled information-theoretic metric to quantify the informational contribution of individual CoT reasoning steps and identify redundancy
Empirical finding that 80% of low-entropy intermediate steps can be pruned with minor accuracy loss, while random or high-entropy pruning severely impairs performance
Two-stage training strategy (SFT + GRPO reinforcement learning) that teaches LLMs to autonomously generate compressed CoTs by learning to insert [SKIP] tokens at redundant steps during inference

Evaluation Highlights

80% of low-entropy intermediate reasoning steps pruned across DeepSeek-R1-7B, DeepSeek-R1-14B, and Qwen3-8B with only minor degradation in final answer accuracy on mathematical reasoning benchmarks
Sharp contrast demonstrated between low-entropy pruning (minor accuracy loss) vs. random or high-entropy pruning (severe reasoning performance impairment), validating the entropy-based selection criterion

Signal Assessment

6/10 The step entropy metric provides a theoretically grounded and empirically validated principle for CoT compression, and the two-stage SFT+GRPO training is a practical advance for efficient inference. However, the core idea of entropy-based pruning and RL-guided token skipping is an incremental extension of existing compression and efficient reasoning literature rather than a paradigm-shifting contribution.

Methodology

Step 1 — Compute step entropy for each intermediate reasoning step in a CoT trace to measure its informational contribution; steps with low entropy are identified as highly redundant candidates for removal
Step 2 — Validate the entropy metric empirically by pruning low-entropy vs. random vs. high-entropy steps across multiple models and benchmarks, confirming that low-entropy pruning preserves accuracy while other strategies degrade performance
Step 3 — Train LLMs using a two-stage pipeline: first Supervised Fine-Tuning (SFT) on compressed CoT data with [SKIP] tokens marking pruned steps, then Group Relative Policy Optimization (GRPO) reinforcement learning to refine the model's ability to autonomously decide which steps to skip during inference

System Components

Step Entropy Metric

An information-theoretic measure that quantifies how much a single reasoning step contributes to the overall CoT; low-entropy steps are near-redundant and safe to prune

Entropy-Guided Pruning

A pruning strategy that selectively removes low-entropy CoT steps, achieving up to 80% step reduction with minimal accuracy degradation on math benchmarks

[SKIP] Token Mechanism

A special token inserted into training data to mark pruned reasoning steps, allowing the model to learn a compact representation of skipped content

SFT Stage

Supervised fine-tuning on entropy-compressed CoT traces with [SKIP] tokens to teach the model the pattern of compressed reasoning

GRPO Stage

Group Relative Policy Optimization reinforcement learning that further refines the model's skip decisions based on final answer correctness rewards, enabling autonomous compression at inference time

Results

Setting	Baseline (full CoT)	This Paper (compressed CoT)	Delta
Low-entropy pruning (80% steps removed), DeepSeek-R1-7B	Full accuracy	Minor accuracy degradation	~80% step reduction, minimal accuracy loss
Low-entropy pruning (80% steps removed), DeepSeek-R1-14B	Full accuracy	Minor accuracy degradation	~80% step reduction, minimal accuracy loss
Low-entropy pruning (80% steps removed), Qwen3-8B	Full accuracy	Minor accuracy degradation	~80% step reduction, minimal accuracy loss
Random pruning (80% steps removed)	Full accuracy	Severe accuracy degradation	Validates entropy metric superiority
High-entropy pruning (80% steps removed)	Full accuracy	Severe accuracy degradation	Validates entropy metric superiority

Key Takeaways

Step entropy provides a cheap, principled way to audit CoT traces for redundancy before deployment—practitioners can use it to estimate compression potential without expensive retraining
The 80% pruning finding suggests that current LLMs significantly over-generate during reasoning; efficiency-focused teams should consider entropy-guided compression as a post-training or inference-time optimization step
The SFT+GRPO two-stage recipe is a reusable training template for teaching any capable base model to self-compress its reasoning, applicable beyond math to other domains with verifiable rewards

Abstract

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies \emph{the informational contribution of individual reasoning steps} to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly improves LLM inference efficiency while preserving accuracy, paving the way for more scalable LLM deployments and a better understanding of their internal reasoning. The code and data are released in https://github.com/staymylove/COT_Compresstion_via_Step_entropy.