Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

On-Policy Self-Distillation (OPSD) enables a single LLM to simultaneously act as teacher and student by conditioning on different contexts (with/without privileged reasoning traces), minimizing per-token divergence over the student's own rollouts to improve reasoning efficiency without requiring a separate larger teacher model.

Problem Statement

Traditional knowledge distillation requires a separate, often larger teacher model and suffers from distribution mismatch in off-policy settings. On-policy distillation addresses distribution mismatch but still depends on external teacher models and fails to leverage ground-truth solutions already present in reasoning datasets. This creates unnecessary computational overhead and architectural complexity for improving LLM reasoning capabilities.

Key Novelty

Single-model teacher-student framework where one LLM plays both roles by conditioning on different contexts (privileged vs. standard), eliminating the need for a separate teacher model
Integration of verified ground-truth reasoning traces as privileged information for the teacher policy, explicitly leveraging supervision signals already available in reasoning datasets
Per-token KL divergence minimization over the student's own on-policy rollouts, combining the distribution-matching benefits of on-policy training with self-supervised rationalization

Evaluation Highlights

Achieves 4-8x token efficiency compared to reinforcement learning methods such as GRPO on mathematical reasoning benchmarks
Superior performance over off-policy distillation methods across multiple mathematical reasoning benchmarks

Signal Assessment

6/10 OPSD presents a clean and practical unification of on-policy distillation and self-improvement without requiring external teacher models, offering meaningful efficiency gains over RL methods. However, the core idea builds incrementally on existing distillation and on-policy training paradigms rather than introducing a fundamentally new learning mechanism.

Methodology

Construct teacher policy by conditioning the same base LLM on privileged information (verified ground-truth reasoning traces concatenated with the question), enabling it to produce high-quality token distributions
Run student policy (conditioned only on the question) to generate on-policy rollouts, ensuring the training distribution matches the inference distribution and avoiding off-policy mismatch
Minimize per-token KL divergence between the teacher's conditional distribution and the student's distribution over the student's own sampled trajectories, updating model weights to transfer privileged reasoning knowledge

System Components

Teacher Policy

The LLM conditioned on both the question and privileged information (verified reasoning traces), producing token-level distributions that reflect high-quality reasoning informed by ground-truth solutions

Student Policy

The same LLM conditioned only on the question (no privileged context), mirroring inference-time conditions and generating on-policy rollouts for training

On-Policy Rollout Mechanism

Trajectory sampling from the student policy to ensure training distribution alignment with inference, preventing the distribution mismatch issues of off-policy distillation

Per-Token KL Divergence Loss

Token-level divergence minimization between teacher and student distributions computed over student-sampled trajectories, providing dense supervision signal for efficient learning

Results

Metric/Benchmark	Baseline	This Paper	Delta
Token Efficiency vs GRPO	1x (baseline)	4-8x more efficient	+4-8x fewer tokens
Math Reasoning vs Off-Policy Distillation	Off-policy baseline	Superior performance	Positive improvement
Math Reasoning vs GRPO (RL)	GRPO baseline	Comparable or better	Better efficiency/accuracy tradeoff
Teacher Model Requirement	Separate larger model needed	Single model (self-distilled)	Eliminates external teacher

Key Takeaways

Practitioners can train stronger reasoning models without maintaining a separate larger teacher LLM — the same model serves both roles by varying its conditioning context, reducing infrastructure complexity and cost
Ground-truth solutions in existing reasoning datasets can be repurposed as privileged teacher context rather than just answer labels, unlocking denser supervision from already-available annotations
For compute-constrained settings, OPSD's 4-8x token efficiency over GRPO makes it a compelling alternative to RL-based reasoning fine-tuning, achieving competitive performance with significantly less training compute

Abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.