Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
Problem Statement
On-policy RLVR methods like GRPO waste generated experiences by discarding them after a single update pass, and suffer from reward homogeneity where easy samples dominate batches, starving the model of learning signal on hard problems. This inefficiency slows convergence and leaves difficult reasoning problems consistently unsolved. Off-policy reuse of experience is a natural remedy but requires careful importance weighting and stability guarantees to avoid policy degradation.
Key Novelty
- Off-policy RLVR framework (BAPO) that maintains a replay buffer of historically difficult and high-quality samples for LLM post-training, a largely unexplored direction in RLVR for LLMs
- Dynamic batch selection mechanism that re-evaluates stored experiences to identify and prioritize genuinely difficult samples, addressing reward homogeneity in on-policy RL
- Theoretical lower bound guarantee on policy improvement during off-policy updates, ensuring training stability despite distributional shift between buffer and current policy
Evaluation Highlights
- BAPO achieves an average 12.5% improvement over GRPO baseline across mathematics, planning, and visual reasoning benchmarks
- BAPO resolves 40.7% of problems that base models consistently fail to solve across all rollouts, demonstrating strong capability on hard samples
Breakthrough Assessment
Methodology
- Collect rollouts during on-policy episodes and store them in a structured replay buffer, tagging samples by difficulty based on reward signals and model performance history
- At each training step, dynamically compose a batch by mixing re-evaluated historically difficult samples and high-quality stored experiences, rather than using only freshly sampled on-policy data
- Apply off-policy policy gradient updates with importance sampling corrections and enforce a theoretical lower bound on policy improvement to maintain training stability and prevent policy collapse
System Components
Stores historical rollout experiences annotated with difficulty and quality metadata, enabling experience reuse across training iterations
Re-evaluates stored samples against the current policy to identify which historical examples remain difficult or informative, and assembles training batches that prioritize hard problems
A theoretical constraint applied during off-policy updates that uses importance weighting to guarantee the policy does not degrade, compensating for distributional shift between buffer data and the current policy
Batch composition strategy that actively counteracts the tendency of on-policy sampling to over-represent easy, high-reward samples by enforcing difficulty diversity in training batches
Results
| Benchmark Domain | GRPO (Baseline) | BAPO (This Paper) | Delta |
|---|---|---|---|
| Mathematics reasoning | Baseline | +~12.5% avg | +12.5% |
| Planning tasks | Baseline | +~12.5% avg | +12.5% |
| Visual reasoning | Baseline | +~12.5% avg | +12.5% |
| Hard problem resolution (consistently failed) | 0% (base model ceiling) | 40.7% solved | +40.7pp |
Key Takeaways
- Maintaining a replay buffer of difficult and high-quality samples is a low-cost, high-impact modification to RLVR training pipelines that can significantly improve learning efficiency without changing the base model architecture
- Reward homogeneity — where easy samples dominate on-policy batches — is a concrete, addressable bottleneck in LLM RL post-training; practitioners should audit their batch composition for difficulty diversity
- Off-policy methods require importance sampling and stability guarantees to work reliably in LLM post-training; the 40.7% resolution rate on previously unsolvable problems suggests targeted hard-sample replay is especially valuable for expanding model capability boundaries
Abstract
Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.