← Back to Papers

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jing Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
arXiv.org | 2025
ReasonFlux-PRM is a trajectory-aware Process Reward Model that explicitly evaluates both step-level and trajectory-level reasoning traces generated by frontier reasoning models, enabling robust supervision across offline data selection, reinforcement learning, and test-time scaling.

Problem Statement

Existing PRMs are trained primarily on final output responses and fail to adequately evaluate intermediate 'thinking trajectories' produced by modern reasoning models like DeepSeek-R1, which generate trajectory-response style outputs. This mismatch leads to poor reward signal quality when PRMs are applied to long chain-of-thought data, limiting their utility for distillation, RL training, and inference-time search. A robust PRM that understands both the reasoning trajectory and the final response is critical as frontier models increasingly rely on extended internal reasoning.

Key Novelty

  • Trajectory-aware PRM architecture that jointly models step-level and trajectory-level supervision, explicitly designed for the trajectory-response output format of frontier reasoning models
  • Unified applicability across three distinct use cases: offline high-quality data selection for SFT distillation, dense process-level reward signals for online RL policy optimization, and Best-of-N reward-guided test-time scaling
  • A compact 1.5B parameter variant (ReasonFlux-PRM-1.5B) enabling deployment in resource-constrained and edge settings, alongside a 7B model that outperforms 72B baselines

Evaluation Highlights

  • ReasonFlux-PRM-7B achieves average gains of 12.1% on downstream SFT benchmarks (AIME, MATH500, GPQA-Diamond) over strong baselines including Qwen2.5-Math-PRM-72B and human-curated data
  • Consistent improvements of 4.5% in reinforcement learning training and 6.3% in Best-of-N test-time scaling compared to baseline PRM-guided approaches

Breakthrough Assessment

7/10 ReasonFlux-PRM addresses a timely and under-explored gap—evaluating trajectory-style reasoning outputs from frontier models—and demonstrates that a 7B model can outperform a 72B baseline PRM across multiple settings, representing a significant practical advance in scalable process supervision for LLM reasoning.

Methodology

  1. Design a trajectory-aware PRM architecture that ingests trajectory-response formatted reasoning traces (as produced by DeepSeek-R1-style models) and assigns rewards at both the individual step level and the overall trajectory level
  2. Train ReasonFlux-PRM using structured chain-of-thought data with fine-grained annotations, incorporating both step-level correctness signals and trajectory-level quality labels to align reward assignment with the reasoning structure
  3. Adapt the trained PRM to three downstream settings: (i) offline filtering/selection of high-quality distillation data for SFT of smaller models, (ii) providing dense intermediate rewards during online RL policy optimization, and (iii) scoring candidate outputs for Best-of-N test-time compute scaling

System Components

Trajectory-Level Supervisor

Evaluates the overall quality and coherence of the full reasoning trajectory, capturing global reasoning patterns beyond individual step correctness

Step-Level Supervisor

Assigns fine-grained reward signals to each intermediate reasoning step, enabling precise identification of where reasoning goes correct or wrong

SFT Data Selector

Uses PRM scores to filter and rank model-generated trajectories for high-quality distillation dataset construction, outperforming human curation and larger PRM baselines

RL Dense Reward Provider

Supplies intermediate process-level reward signals during reinforcement learning training, improving policy optimization over sparse outcome-only rewards

Best-of-N Ranker

Scores multiple sampled reasoning trajectories at test time to select the best response, enabling effective inference-time compute scaling

ReasonFlux-PRM-1.5B

A compact model variant designed for edge deployment and resource-constrained environments while retaining core trajectory-aware evaluation capability

Results

Setting Baseline PRM ReasonFlux-PRM-7B Delta
SFT Data Selection (AIME/MATH500/GPQA avg) Qwen2.5-Math-PRM-72B / Human-curated ReasonFlux-PRM-7B +12.1%
Reinforcement Learning Training Baseline PRM reward signal ReasonFlux-PRM-7B dense rewards +4.5%
Best-of-N Test-Time Scaling Baseline PRM scoring ReasonFlux-PRM-7B scoring +6.3%

Key Takeaways

  • When working with DeepSeek-R1-style or other trajectory-response reasoning models, standard PRMs trained on final outputs are insufficient—practitioners should use trajectory-aware PRMs like ReasonFlux-PRM that understand the internal thinking structure
  • A well-designed 7B PRM can outperform a 72B baseline for data selection and reward modeling, suggesting that architecture and training data alignment matter more than raw model scale for process supervision
  • ReasonFlux-PRM's three-setting applicability (SFT data selection, RL training, test-time scaling) makes it a versatile drop-in component for LLM reasoning pipelines, and the 1.5B variant makes trajectory-aware process supervision accessible for resource-constrained deployments

Abstract

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Project: https://github.com/Gen-Verse/ReasonFlux

Generated on 2026-03-02 using Claude