← Back to Papers

Structured Reasoning for Large Language Models

Jinyi Han, Zixiang Di, Zishang Jiang, Ying Liao, Jiaqing Liang, Yongqi Wang, Yanghua Xiao
arXiv.org | 2026
Structured Reasoning (SCR) decomposes LLM reasoning trajectories into explicit Generate-Verify-Revise components with targeted supervision, eliminating redundant reasoning steps while maintaining accuracy. This framework applies progressive reinforcement learning to train each reasoning ability independently, achieving up to 50% token reduction.

Problem Statement

LLMs generating long chains of thought frequently produce redundant or ineffective reasoning steps, including unnecessary verification and revision cycles even after reaching correct answers. This inefficiency stems from unstructured reasoning trajectories that lack targeted supervision for distinct reasoning sub-abilities. Existing approaches treat reasoning as a monolithic process, making it difficult to identify and correct specific failure modes like over-verification.

Key Novelty

  • Structured Reasoning (SCR) framework that explicitly decouples reasoning trajectories into evaluable and independently trainable components (Generate, Verify, Revise)
  • Dynamic Termination Supervision that teaches models to recognize when reasoning is complete and stop early, directly targeting the over-generation problem
  • Progressive two-stage reinforcement learning strategy that trains initial generation/self-verification in stage one and revision in stage two to avoid learning signal interference between reasoning abilities

Evaluation Highlights

  • SCR reduces output token length by up to 50% compared to existing reasoning paradigms across three backbone models
  • SCR substantially improves both reasoning efficiency and self-verification accuracy, demonstrating that structured decomposition enhances quality while reducing compute cost

Breakthrough Assessment

6/10 SCR offers a well-motivated and practically impactful framework for reasoning efficiency with a clean modular design and strong empirical results (50% token reduction). However, the Generate-Verify-Revise paradigm is an established concept, and the novelty lies primarily in the training methodology rather than a fundamentally new reasoning architecture.

Methodology

  1. Decompose reasoning trajectories into three explicit, labeled components — Generate (initial answer), Verify (correctness check), and Revise (correction) — and construct structured training data accordingly
  2. Apply Dynamic Termination Supervision during training to provide targeted signals that teach the model to terminate reasoning when verification confirms a correct answer, avoiding unnecessary continuation
  3. Train using a progressive two-stage reinforcement learning strategy: Stage 1 optimizes initial generation and self-verification jointly, Stage 2 fine-tunes revision behavior independently to prevent cross-component learning interference

System Components

Generate

The initial reasoning and answer generation step, producing a candidate solution before any verification occurs

Verify

An explicit self-verification step where the model checks whether its generated answer is correct, trained as a distinct reasoning ability

Revise

A conditional correction step that is only triggered when verification identifies an error, avoiding redundant revisions on correct answers

Dynamic Termination Supervision

A targeted supervision signal that trains the model to decide when to stop the reasoning chain, preventing over-generation after correct answers are verified

Progressive Two-Stage RL

A reinforcement learning curriculum that separates training of generation/verification (Stage 1) from revision (Stage 2) to prevent gradient interference between distinct reasoning abilities

Results

Metric/Benchmark Baseline (Standard CoT) SCR (This Paper) Delta
Output Token Length Baseline length Up to 50% shorter -50% (max)
Reasoning Efficiency Standard Substantially improved Significant gain
Self-Verification Accuracy Standard Substantially improved Significant gain
Backbone Models Tested N/A 3 models evaluated Consistent gains

Key Takeaways

  • Decomposing reasoning into discrete, trainable sub-components (Generate-Verify-Revise) is a practical strategy for reducing token waste in production LLM deployments, with up to 50% cost savings in inference compute
  • Applying separate reinforcement learning stages for distinct reasoning abilities (generation vs. revision) is critical to avoid signal interference — practitioners fine-tuning reasoning models should consider curriculum-based RL rather than joint training
  • Dynamic Termination Supervision is a transferable technique: explicitly training models to recognize 'done' states addresses a systemic over-generation problem that affects most long-chain-of-thought models and can be applied beyond this specific framework

Abstract

Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.

Generated on 2026-02-21 using Claude