Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Problem Statement
Traditional knowledge distillation requires a separate, often larger teacher model and suffers from distribution mismatch in off-policy settings. On-policy distillation addresses distribution mismatch but still depends on external teacher models and fails to leverage ground-truth solutions already present in reasoning datasets. This creates unnecessary computational overhead and architectural complexity for improving LLM reasoning capabilities.
Key Novelty
- Single-model teacher-student framework where one LLM plays both roles by conditioning on different contexts (privileged vs. standard), eliminating the need for a separate teacher model
- Integration of verified ground-truth reasoning traces as privileged information for the teacher policy, explicitly leveraging supervision signals already available in reasoning datasets
- Per-token KL divergence minimization over the student's own on-policy rollouts, combining the distribution-matching benefits of on-policy training with self-supervised rationalization
Evaluation Highlights
- Achieves 4-8x token efficiency compared to reinforcement learning methods such as GRPO on mathematical reasoning benchmarks
- Superior performance over off-policy distillation methods across multiple mathematical reasoning benchmarks
Breakthrough Assessment
Methodology
- Construct teacher policy by conditioning the same base LLM on privileged information (verified ground-truth reasoning traces concatenated with the question), enabling it to produce high-quality token distributions
- Run student policy (conditioned only on the question) to generate on-policy rollouts, ensuring the training distribution matches the inference distribution and avoiding off-policy mismatch
- Minimize per-token KL divergence between the teacher's conditional distribution and the student's distribution over the student's own sampled trajectories, updating model weights to transfer privileged reasoning knowledge
System Components
The LLM conditioned on both the question and privileged information (verified reasoning traces), producing token-level distributions that reflect high-quality reasoning informed by ground-truth solutions
The same LLM conditioned only on the question (no privileged context), mirroring inference-time conditions and generating on-policy rollouts for training
Trajectory sampling from the student policy to ensure training distribution alignment with inference, preventing the distribution mismatch issues of off-policy distillation
Token-level divergence minimization between teacher and student distributions computed over student-sampled trajectories, providing dense supervision signal for efficient learning
Results
| Metric/Benchmark | Baseline | This Paper | Delta |
|---|---|---|---|
| Token Efficiency vs GRPO | 1x (baseline) | 4-8x more efficient | +4-8x fewer tokens |
| Math Reasoning vs Off-Policy Distillation | Off-policy baseline | Superior performance | Positive improvement |
| Math Reasoning vs GRPO (RL) | GRPO baseline | Comparable or better | Better efficiency/accuracy tradeoff |
| Teacher Model Requirement | Separate larger model needed | Single model (self-distilled) | Eliminates external teacher |
Key Takeaways
- Practitioners can train stronger reasoning models without maintaining a separate larger teacher LLM — the same model serves both roles by varying its conditioning context, reducing infrastructure complexity and cost
- Ground-truth solutions in existing reasoning datasets can be repurposed as privileged teacher context rather than just answer labels, unlocking denser supervision from already-available annotations
- For compute-constrained settings, OPSD's 4-8x token efficiency over GRPO makes it a compelling alternative to RL-based reasoning fine-tuning, achieving competitive performance with significantly less training compute
Abstract
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.