Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

A multi-stage training pipeline combining supervised fine-tuning with reinforcement learning and an adaptive length penalty reduces LLM reasoning verbosity by 28-40% while maintaining near-identical accuracy. The method introduces a lightweight reward function that penalizes post-answer tokens while selectively encouraging self-verification only when it is beneficial.

Problem Statement

Large language models increasingly rely on chain-of-thought reasoning at test time, but CoT traces frequently become unnecessarily long, wasting compute without accuracy gains—a phenomenon called 'overthinking'. Existing efficient reasoning methods either sacrifice accuracy significantly or require complex architectures, leaving a gap for simple yet effective approaches. This work addresses the accuracy-vs-length trade-off in a holistic, benchmark-diverse manner.

Key Novelty

A multi-stage training framework combining rejection sampling or reasoning trace reformatting (SFT stage) followed by RL with an adaptive length penalty tailored to reasoning efficiency
A lightweight reward function that penalizes tokens generated after the first correct answer while selectively rewarding self-verification only when it improves outcomes
A new evaluation metric, AUC_OAA (Area Under the Overthinking-Adjusted Accuracy curve), enabling holistic assessment of the accuracy-response length trade-off across diverse tasks

Evaluation Highlights

Response length reduced by 28% for 8B models and 40% for 32B models with only 1.6 and 2.5 accuracy point drops respectively across seven diverse reasoning benchmarks
AUC_OAA score of 76.6—5 points above the base model and 2.5 points above the second-best competing efficient reasoning approach

Breakthrough Assessment

6/10 The paper delivers a solid, practically impactful contribution by achieving a superior accuracy-efficiency trade-off with conceptual simplicity over more complex state-of-the-art methods, but the individual components (SFT + RL with length penalty) are not fundamentally novel in isolation—the novelty lies in their careful combination and the new evaluation metric.

Methodology

Stage 1 (SFT): Fine-tune the base LLM using either rejection sampling (selecting short correct traces) or reasoning trace reformatting (truncating/restructuring existing traces) to instill a preference for concise reasoning
Stage 2 (RL): Apply reinforcement learning with an adaptive length penalty reward that penalizes tokens generated beyond the first correct answer, while optionally rewarding self-verification steps only when they demonstrably improve answer quality
Stage 3 (Evaluation): Assess the trained model holistically across seven reasoning tasks using both standard accuracy and the AUC_OAA metric to quantify the accuracy-vs-length Pareto trade-off

System Components

Rejection Sampling SFT

Filters training traces to retain only short, correct reasoning paths, teaching the model that concise solutions are preferred

Reasoning Trace Reformatting SFT

Restructures or truncates existing reasoning traces to remove redundant steps, creating a cleaner supervised signal for brevity

Adaptive Length Penalty (RL)

A reward signal in RL that penalizes tokens generated after the first correct answer is reached, discouraging unnecessary continuation

Selective Self-Verification Reward

A component of the reward function that encourages self-verification steps only when they have been shown to improve final answer correctness, avoiding blanket penalization of all verification

AUC_OAA Metric

Area Under the Overthinking-Adjusted Accuracy curve—a new holistic evaluation metric that jointly captures accuracy and response length efficiency across varying length budgets

Results

Metric/Benchmark	Base Model	This Paper	Delta
Response Length (8B models)	Baseline length	-28% shorter	-28%
Response Length (32B models)	Baseline length	-40% shorter	-40%
Accuracy Drop (8B models)	—	-1.6 points	-1.6 pts
Accuracy Drop (32B models)	—	-2.5 points	-2.5 pts
AUC_OAA Score	71.6 (base model)	76.6	+5.0 pts
AUC_OAA vs 2nd best method	74.1 (2nd best)	76.6	+2.5 pts

Key Takeaways

Combining a concise-reasoning SFT warm-up with RL length penalties is a practical recipe for reducing inference compute costs by 28-40% with minimal accuracy loss—making it deployable for cost-sensitive production LLM systems
Rewarding self-verification selectively (only when beneficial) rather than penalizing all extended reasoning avoids over-pruning and preserves accuracy on problems that genuinely benefit from verification steps
The AUC_OAA metric is a useful tool for practitioners evaluating efficient reasoning methods, as it captures the full accuracy-vs-length trade-off rather than a single operating point, enabling fairer comparison across methods

Abstract

The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.