Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning
Problem Statement
Large language models increasingly rely on chain-of-thought reasoning at test time, but CoT traces frequently become unnecessarily long, wasting compute without accuracy gains—a phenomenon called 'overthinking'. Existing efficient reasoning methods either sacrifice accuracy significantly or require complex architectures, leaving a gap for simple yet effective approaches. This work addresses the accuracy-vs-length trade-off in a holistic, benchmark-diverse manner.
Key Novelty
- A multi-stage training framework combining rejection sampling or reasoning trace reformatting (SFT stage) followed by RL with an adaptive length penalty tailored to reasoning efficiency
- A lightweight reward function that penalizes tokens generated after the first correct answer while selectively rewarding self-verification only when it improves outcomes
- A new evaluation metric, AUC_OAA (Area Under the Overthinking-Adjusted Accuracy curve), enabling holistic assessment of the accuracy-response length trade-off across diverse tasks
Evaluation Highlights
- Response length reduced by 28% for 8B models and 40% for 32B models with only 1.6 and 2.5 accuracy point drops respectively across seven diverse reasoning benchmarks
- AUC_OAA score of 76.6—5 points above the base model and 2.5 points above the second-best competing efficient reasoning approach
Breakthrough Assessment
Methodology
- Stage 1 (SFT): Fine-tune the base LLM using either rejection sampling (selecting short correct traces) or reasoning trace reformatting (truncating/restructuring existing traces) to instill a preference for concise reasoning
- Stage 2 (RL): Apply reinforcement learning with an adaptive length penalty reward that penalizes tokens generated beyond the first correct answer, while optionally rewarding self-verification steps only when they demonstrably improve answer quality
- Stage 3 (Evaluation): Assess the trained model holistically across seven reasoning tasks using both standard accuracy and the AUC_OAA metric to quantify the accuracy-vs-length Pareto trade-off
System Components
Filters training traces to retain only short, correct reasoning paths, teaching the model that concise solutions are preferred
Restructures or truncates existing reasoning traces to remove redundant steps, creating a cleaner supervised signal for brevity
A reward signal in RL that penalizes tokens generated after the first correct answer is reached, discouraging unnecessary continuation
A component of the reward function that encourages self-verification steps only when they have been shown to improve final answer correctness, avoiding blanket penalization of all verification
Area Under the Overthinking-Adjusted Accuracy curve—a new holistic evaluation metric that jointly captures accuracy and response length efficiency across varying length budgets
Results
| Metric/Benchmark | Base Model | This Paper | Delta |
|---|---|---|---|
| Response Length (8B models) | Baseline length | -28% shorter | -28% |
| Response Length (32B models) | Baseline length | -40% shorter | -40% |
| Accuracy Drop (8B models) | — | -1.6 points | -1.6 pts |
| Accuracy Drop (32B models) | — | -2.5 points | -2.5 pts |
| AUC_OAA Score | 71.6 (base model) | 76.6 | +5.0 pts |
| AUC_OAA vs 2nd best method | 74.1 (2nd best) | 76.6 | +2.5 pts |
Key Takeaways
- Combining a concise-reasoning SFT warm-up with RL length penalties is a practical recipe for reducing inference compute costs by 28-40% with minimal accuracy loss—making it deployable for cost-sensitive production LLM systems
- Rewarding self-verification selectively (only when beneficial) rather than penalizing all extended reasoning avoids over-pruning and preserves accuracy on problems that genuinely benefit from verification steps
- The AUC_OAA metric is a useful tool for practitioners evaluating efficient reasoning methods, as it captures the full accuracy-vs-length trade-off rather than a single operating point, enabling fairer comparison across methods
Abstract
The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy-response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.