Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
Problem Statement
Current reasoning model training (SFT + RLVR) is bottlenecked by expensive human-annotated reasoning data and external verifiers that only work for objectively assessable domains like math and coding. This restricts reasoning capabilities to a narrow set of verifiable tasks while embedding human cognitive biases and incurring high data-collection costs. A general framework that works across unverifiable domains is critically needed for broader real-world applicability.
Key Novelty
- Latent variable framing of reasoning: treats reasoning traces as latent variables optimized to maximize ground-truth answer likelihood, enabling training without expert-written demonstrations
- Unified training objective that intrinsically rewards reasoning paths increasing model confidence in correct answers, creating a self-reinforcing feedback loop without external verifiers
- Systematic analysis of failure modes (e.g., policy collapse) in prior verifier-free methods and principled design of more robust reward aggregation functions to address them
Evaluation Highlights
- State-of-the-art performance among verifier-free methods on Llama and Mistral model families, significantly outperforming standard SFT baselines and prior verifier-free RL approaches
- Particularly strong gains in complex reasoning domains with high robustness to policy collapse compared to existing verifier-free methods
Breakthrough Assessment
Methodology
- Step 1 — Latent Variable Formulation: Model the reasoning chain as a latent variable z between a question q and answer a, framing training as maximizing P(a|q) marginalized over all possible reasoning paths
- Step 2 — Self-Generated Traces: Use the model itself to generate candidate reasoning traces from standard Q&A pairs, scoring them by how much they increase the likelihood of the ground-truth answer (no external verifier needed)
- Step 3 — Robust Reward Aggregation: Apply principled reward aggregation functions over sampled reasoning traces to update the policy, specifically designed to avoid policy collapse and reinforce reasoning paths that resolve model uncertainty
System Components
Treats intermediate reasoning steps as latent variables, allowing the model to optimize reasoning quality indirectly through its effect on answer likelihood
Intrinsic reward computed as the increase in log-probability of the ground-truth answer when conditioning on a generated reasoning trace, requiring only Q&A pairs
Aggregates rewards across multiple sampled reasoning paths in a way that is theoretically grounded to prevent policy collapse and degenerate solutions
A single training objective that unifies supervised and reinforcement learning signals, enabling analysis of prior methods' failure modes under a common framework
Results
| Benchmark/Setting | Standard SFT Baseline | NRT (This Paper) | Delta |
|---|---|---|---|
| Complex Reasoning (Llama family) | Moderate | State-of-the-art (verifier-free) | Significant improvement |
| Complex Reasoning (Mistral family) | Moderate | State-of-the-art (verifier-free) | Significant improvement |
| Prior verifier-free RL methods | Prior SOTA | Outperforms | Positive gap |
| Policy Collapse Robustness | Low (collapses) | High robustness | Qualitatively better |
Key Takeaways
- ML practitioners can now train reasoning-capable models on any dataset with Q&A pairs — no need to collect expensive chain-of-thought annotations or build domain-specific verifiers, dramatically lowering the barrier for reasoning in new domains
- The latent variable framing provides a principled lens to diagnose and fix failure modes (like policy collapse) in verifier-free RL training, offering a practical toolkit for more stable reasoning model development
- NRT's architecture-agnostic nature (validated on Llama and Mistral) suggests it can serve as a drop-in training framework for teams looking to add reasoning capabilities to existing LLMs without specialized infrastructure
Abstract
The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.