Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision
Problem Statement
Long-context LLMs struggle to reason over extensive inputs to aggregate relevant information, yet most training signal comes from final answer supervision (outcome-only), which provides sparse feedback. While CoT prompting helps multi-step reasoning, its systematic benefits for long-context scenarios were underexplored, and existing process supervision methods lack quality assessment protocols tailored to long-context challenges.
Key Novelty
- Systematic empirical demonstration that CoT benefits generalize across diverse long-context tasks and scale with increasing context length
- LongRePS: a process-supervised training framework using self-sampling to bootstrap reasoning paths without requiring human-annotated chain-of-thought data
- A novel quality assessment protocol specifically designed to evaluate and filter reasoning paths in long-context scenarios, enabling higher-quality supervision signal
Evaluation Highlights
- In-domain improvements over outcome supervision baselines: +13.6 points (LLaMA) and +3.8 points (Qwen) on MuSiQue multi-hop QA benchmark
- Cross-domain generalization: +9.3 points (LLaMA) and +8.1 points (Qwen) average improvement across diverse long-context QA tasks
Breakthrough Assessment
Methodology
- Step 1 - Empirical Analysis: Systematically evaluate CoT prompting across diverse long-context tasks and varying context lengths to establish that CoT benefits scale with context length
- Step 2 - Reasoning Path Bootstrapping: Use a self-sampling mechanism to generate candidate reasoning paths from the model itself, avoiding costly human annotation of long-context CoT data
- Step 3 - Process-Supervised Training: Apply the long-context-specific quality assessment protocol to filter/score bootstrapped reasoning paths, then train models with process-level supervision rather than outcome-only supervision
System Components
Systematic investigation framework that benchmarks CoT prompting across multiple long-context task types and context lengths to quantify when and why CoT helps
Bootstraps reasoning paths by sampling multiple candidate CoT traces from the model itself, enabling scalable collection of training data without manual annotation
A novel evaluation scheme tailored to long-context scenarios that scores reasoning path quality based on factors relevant to context aggregation and multi-step reasoning fidelity
End-to-end process supervision framework that combines bootstrapped paths and quality scores to train LLMs with step-level reward signals for improved long-context reasoning
Results
| Benchmark / Setting | Outcome Supervision Baseline | LongRePS (This Paper) | Delta |
|---|---|---|---|
| MuSiQue (LLaMA, in-domain) | Baseline | Baseline + 13.6 pts | +13.6 points |
| MuSiQue (Qwen, in-domain) | Baseline | Baseline + 3.8 pts | +3.8 points |
| Diverse QA Tasks (LLaMA, cross-domain avg) | Baseline | Baseline + 9.3 pts | +9.3 points |
| Diverse QA Tasks (Qwen, cross-domain avg) | Baseline | Baseline + 8.1 pts | +8.1 points |
Key Takeaways
- For long-context tasks, always incorporate CoT prompting at inference time — the benefits compound with context length, making it especially valuable for RAG, document QA, and multi-hop reasoning pipelines
- Process supervision (rewarding intermediate reasoning steps) substantially outperforms outcome-only supervision for long-context fine-tuning; practitioners training long-context models should invest in step-level feedback mechanisms
- Self-sampling can bootstrap high-quality CoT training data without human annotation, making process-supervised training accessible for teams without large annotation budgets — the key is pairing it with a domain-appropriate quality filter
Abstract
Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.