Self-Improving Pretraining: using post-trained models to pretrain better models

Self-Improving Pretraining (SIP) integrates reinforcement learning directly into the pretraining loop, using a strong post-trained model as a judge to improve quality, safety, and factuality of generated token sequences during pretraining itself.

Problem Statement

Current LLM safety and quality improvements rely on expensive post-training pipelines (fine-tuning, RLHF, alignment), but these cannot fully correct undesirable patterns baked in during pretraining. Addressing safety, factuality, and quality at the pretraining stage is crucial because core model behaviors are established then. Existing pretraining methods use static corpora without any online quality feedback, leaving models vulnerable to learning from low-quality or harmful text.

Key Novelty

First method to apply RL-based quality optimization directly during pretraining by streaming documents and optimizing the next K generated tokens at each training step
A curriculum-based hybrid reward mechanism that transitions from relying on original/rewritten suffixes early in training to rewarding model rollouts as the model improves
Use of a strong post-trained judge model to score candidate generations (rollouts, original suffix, rewritten suffix) for quality, safety, and factuality during pretraining

Evaluation Highlights

36.2% relative improvement in factuality and 18.5% relative improvement in safety over standard pretraining baselines
Up to 86.3% win rate improvement in overall generation quality compared to standard pretraining

Signal Assessment

8/10 Integrating RL-based quality feedback directly into pretraining is a significant architectural and methodological advance that challenges the conventional post-training-only alignment paradigm, with strong empirical gains in safety and factuality; however, computational overhead and scalability to frontier-scale models remain open questions.

Methodology

Stream pretraining documents and at each step generate multiple candidate continuations (rollouts, original suffix, rewritten suffix) for the next K tokens
A strong post-trained judge model scores each candidate on quality, safety, and factuality, providing RL reward signals to the model being pretrained
Apply a curriculum: early in training use original and rewritten suffixes as primary learning signal; as the model improves, shift toward rewarding high-quality model rollouts via RL, progressively bootstrapping self-improvement

System Components

Token-level RL during Pretraining

Applies reinforcement learning at the next-K-token prediction level during pretraining, optimizing each document chunk for quality rather than just next-token likelihood

Post-trained Judge Model

A strong, already-aligned LLM that evaluates candidate generations for factuality, safety, and overall quality, providing scalar reward signals

Rewritten Suffix Generator

Produces improved versions of original document continuations to serve as high-quality reference targets, especially useful early in training

Adaptive Curriculum

Controls the mixture of original, rewritten, and rollout-based learning signals over training time, transitioning from imitation to RL as model capability grows

Results

Metric/Benchmark	Standard Pretraining	Self-Improving Pretraining	Delta
Factuality	Baseline	+36.2% relative	+36.2% relative improvement
Safety	Baseline	+18.5% relative	+18.5% relative improvement
Overall Generation Quality (Win Rate)	Baseline	Up to 86.3% win rate	Up to +86.3% win rate improvement

Key Takeaways

Safety and factuality problems are better addressed at pretraining time rather than patched post-hoc; practitioners building new models from scratch should consider integrating quality feedback loops during pretraining
A strong post-trained judge model can serve as a scalable, reusable reward signal for pretraining, reducing the need for massive curated datasets while improving model quality
The curriculum approach (starting with supervised rewriting, transitioning to RL rollouts) provides a practical blueprint for stabilizing RL during the noisy, large-scale pretraining regime

Abstract

Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.