← Back to Papers

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang
2026
NRT (Native Reasoning Training) trains language models to reason on unverifiable data by treating reasoning as a latent variable and using only standard question-answer pairs, eliminating the need for expert-written reasoning traces or external verifiers.

Problem Statement

Current reasoning model training (SFT + RLVR) is bottlenecked by expensive human-annotated reasoning data and external verifiers that only work for objectively assessable domains like math and coding. This restricts reasoning capabilities to a narrow set of verifiable tasks while embedding human cognitive biases and incurring high data-collection costs. A general framework that works across unverifiable domains is critically needed for broader real-world applicability.

Key Novelty

  • Latent variable framing of reasoning: treats reasoning traces as latent variables optimized to maximize ground-truth answer likelihood, enabling training without expert-written demonstrations
  • Unified training objective that intrinsically rewards reasoning paths increasing model confidence in correct answers, creating a self-reinforcing feedback loop without external verifiers
  • Systematic analysis of failure modes (e.g., policy collapse) in prior verifier-free methods and principled design of more robust reward aggregation functions to address them

Evaluation Highlights

  • State-of-the-art performance among verifier-free methods on Llama and Mistral model families, significantly outperforming standard SFT baselines and prior verifier-free RL approaches
  • Particularly strong gains in complex reasoning domains with high robustness to policy collapse compared to existing verifier-free methods

Breakthrough Assessment

7/10 NRT addresses a fundamental bottleneck in reasoning model training by removing the dependency on verifiers and expert annotations, potentially unlocking reasoning capabilities across a much broader range of tasks. While the latent variable framing is elegant and practically impactful, the approach builds on existing SFT/RL foundations and its real-world breadth still needs validation beyond standard benchmarks.

Methodology

  1. Step 1 — Latent Variable Formulation: Model the reasoning chain as a latent variable z between a question q and answer a, framing training as maximizing P(a|q) marginalized over all possible reasoning paths
  2. Step 2 — Self-Generated Traces: Use the model itself to generate candidate reasoning traces from standard Q&A pairs, scoring them by how much they increase the likelihood of the ground-truth answer (no external verifier needed)
  3. Step 3 — Robust Reward Aggregation: Apply principled reward aggregation functions over sampled reasoning traces to update the policy, specifically designed to avoid policy collapse and reinforce reasoning paths that resolve model uncertainty

System Components

Latent Reasoning Variable Model

Treats intermediate reasoning steps as latent variables, allowing the model to optimize reasoning quality indirectly through its effect on answer likelihood

Verifier-Free Reward Signal

Intrinsic reward computed as the increase in log-probability of the ground-truth answer when conditioning on a generated reasoning trace, requiring only Q&A pairs

Robust Reward Aggregation Function

Aggregates rewards across multiple sampled reasoning paths in a way that is theoretically grounded to prevent policy collapse and degenerate solutions

Unified SFT+RL Objective

A single training objective that unifies supervised and reinforcement learning signals, enabling analysis of prior methods' failure modes under a common framework

Results

Benchmark/Setting Standard SFT Baseline NRT (This Paper) Delta
Complex Reasoning (Llama family) Moderate State-of-the-art (verifier-free) Significant improvement
Complex Reasoning (Mistral family) Moderate State-of-the-art (verifier-free) Significant improvement
Prior verifier-free RL methods Prior SOTA Outperforms Positive gap
Policy Collapse Robustness Low (collapses) High robustness Qualitatively better

Key Takeaways

  • ML practitioners can now train reasoning-capable models on any dataset with Q&A pairs — no need to collect expensive chain-of-thought annotations or build domain-specific verifiers, dramatically lowering the barrier for reasoning in new domains
  • The latent variable framing provides a principled lens to diagnose and fix failure modes (like policy collapse) in verifier-free RL training, offering a practical toolkit for more stable reasoning model development
  • NRT's architecture-agnostic nature (validated on Llama and Mistral) suggests it can serve as a drop-in training framework for teams looking to add reasoning capabilities to existing LLMs without specialized infrastructure

Abstract

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.

Generated on 2026-03-02 using Claude