← Back to Papers

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, Xun Wang, Lin Sun, Xiangzheng Zhang, Sujian Li
Conference on Empirical Methods in Natural Language Processing | 2025
Chain-of-Thought reasoning consistently improves long-context LLM performance, and a process-supervised framework (LongRePS) that teaches models to generate high-quality reasoning paths yields significant gains over outcome-only supervision in long-context tasks.

Problem Statement

Long-context LLMs struggle to reason over extensive inputs to aggregate relevant information, yet most training signal comes from final answer supervision (outcome-only), which provides sparse feedback. While CoT prompting helps multi-step reasoning, its systematic benefits for long-context scenarios were underexplored, and existing process supervision methods lack quality assessment protocols tailored to long-context challenges.

Key Novelty

  • Systematic empirical demonstration that CoT benefits generalize across diverse long-context tasks and scale with increasing context length
  • LongRePS: a process-supervised training framework using self-sampling to bootstrap reasoning paths without requiring human-annotated chain-of-thought data
  • A novel quality assessment protocol specifically designed to evaluate and filter reasoning paths in long-context scenarios, enabling higher-quality supervision signal

Evaluation Highlights

  • In-domain improvements over outcome supervision baselines: +13.6 points (LLaMA) and +3.8 points (Qwen) on MuSiQue multi-hop QA benchmark
  • Cross-domain generalization: +9.3 points (LLaMA) and +8.1 points (Qwen) average improvement across diverse long-context QA tasks

Breakthrough Assessment

6/10 LongRePS is a solid and practical contribution that extends process supervision to long-context settings with meaningful empirical gains, but the core ideas (CoT + process supervision + self-sampling) are evolutionary rather than paradigm-shifting, building on well-established techniques applied to a specific and important problem.

Methodology

  1. Step 1 - Empirical Analysis: Systematically evaluate CoT prompting across diverse long-context tasks and varying context lengths to establish that CoT benefits scale with context length
  2. Step 2 - Reasoning Path Bootstrapping: Use a self-sampling mechanism to generate candidate reasoning paths from the model itself, avoiding costly human annotation of long-context CoT data
  3. Step 3 - Process-Supervised Training: Apply the long-context-specific quality assessment protocol to filter/score bootstrapped reasoning paths, then train models with process-level supervision rather than outcome-only supervision

System Components

CoT Analysis Module

Systematic investigation framework that benchmarks CoT prompting across multiple long-context task types and context lengths to quantify when and why CoT helps

Self-Sampling Mechanism

Bootstraps reasoning paths by sampling multiple candidate CoT traces from the model itself, enabling scalable collection of training data without manual annotation

Long-Context Quality Assessment Protocol

A novel evaluation scheme tailored to long-context scenarios that scores reasoning path quality based on factors relevant to context aggregation and multi-step reasoning fidelity

LongRePS Training Framework

End-to-end process supervision framework that combines bootstrapped paths and quality scores to train LLMs with step-level reward signals for improved long-context reasoning

Results

Benchmark / Setting Outcome Supervision Baseline LongRePS (This Paper) Delta
MuSiQue (LLaMA, in-domain) Baseline Baseline + 13.6 pts +13.6 points
MuSiQue (Qwen, in-domain) Baseline Baseline + 3.8 pts +3.8 points
Diverse QA Tasks (LLaMA, cross-domain avg) Baseline Baseline + 9.3 pts +9.3 points
Diverse QA Tasks (Qwen, cross-domain avg) Baseline Baseline + 8.1 pts +8.1 points

Key Takeaways

  • For long-context tasks, always incorporate CoT prompting at inference time — the benefits compound with context length, making it especially valuable for RAG, document QA, and multi-hop reasoning pipelines
  • Process supervision (rewarding intermediate reasoning steps) substantially outperforms outcome-only supervision for long-context fine-tuning; practitioners training long-context models should invest in step-level feedback mechanisms
  • Self-sampling can bootstrap high-quality CoT training data without human annotation, making process-supervised training accessible for teams without large annotation budgets — the key is pairing it with a domain-appropriate quality filter

Abstract

Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.

Generated on 2026-02-21 using Claude