← Back to Papers

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun
2026
BAPO (Batch Adaptation Policy Optimization) is an off-policy RLVR framework that improves LLM post-training efficiency by dynamically selecting training batches from a replay buffer, re-evaluating historically difficult samples and reusing high-quality experiences with a guaranteed policy improvement lower bound.

Problem Statement

On-policy RLVR methods like GRPO waste generated experiences by discarding them after a single update pass, and suffer from reward homogeneity where easy samples dominate batches, starving the model of learning signal on hard problems. This inefficiency slows convergence and leaves difficult reasoning problems consistently unsolved. Off-policy reuse of experience is a natural remedy but requires careful importance weighting and stability guarantees to avoid policy degradation.

Key Novelty

  • Off-policy RLVR framework (BAPO) that maintains a replay buffer of historically difficult and high-quality samples for LLM post-training, a largely unexplored direction in RLVR for LLMs
  • Dynamic batch selection mechanism that re-evaluates stored experiences to identify and prioritize genuinely difficult samples, addressing reward homogeneity in on-policy RL
  • Theoretical lower bound guarantee on policy improvement during off-policy updates, ensuring training stability despite distributional shift between buffer and current policy

Evaluation Highlights

  • BAPO achieves an average 12.5% improvement over GRPO baseline across mathematics, planning, and visual reasoning benchmarks
  • BAPO resolves 40.7% of problems that base models consistently fail to solve across all rollouts, demonstrating strong capability on hard samples

Breakthrough Assessment

6/10 BAPO is a solid and practically motivated contribution that imports well-established off-policy RL ideas into the RLVR-for-LLMs pipeline with meaningful empirical gains; however, it is an incremental architectural improvement over GRPO rather than a paradigm shift, and replay buffers are a known technique in RL.

Methodology

  1. Collect rollouts during on-policy episodes and store them in a structured replay buffer, tagging samples by difficulty based on reward signals and model performance history
  2. At each training step, dynamically compose a batch by mixing re-evaluated historically difficult samples and high-quality stored experiences, rather than using only freshly sampled on-policy data
  3. Apply off-policy policy gradient updates with importance sampling corrections and enforce a theoretical lower bound on policy improvement to maintain training stability and prevent policy collapse

System Components

Replay Buffer

Stores historical rollout experiences annotated with difficulty and quality metadata, enabling experience reuse across training iterations

Dynamic Batch Selector

Re-evaluates stored samples against the current policy to identify which historical examples remain difficult or informative, and assembles training batches that prioritize hard problems

Policy Improvement Lower Bound

A theoretical constraint applied during off-policy updates that uses importance weighting to guarantee the policy does not degrade, compensating for distributional shift between buffer data and the current policy

Reward Homogeneity Mitigation

Batch composition strategy that actively counteracts the tendency of on-policy sampling to over-represent easy, high-reward samples by enforcing difficulty diversity in training batches

Results

Benchmark Domain GRPO (Baseline) BAPO (This Paper) Delta
Mathematics reasoning Baseline +~12.5% avg +12.5%
Planning tasks Baseline +~12.5% avg +12.5%
Visual reasoning Baseline +~12.5% avg +12.5%
Hard problem resolution (consistently failed) 0% (base model ceiling) 40.7% solved +40.7pp

Key Takeaways

  • Maintaining a replay buffer of difficult and high-quality samples is a low-cost, high-impact modification to RLVR training pipelines that can significantly improve learning efficiency without changing the base model architecture
  • Reward homogeneity — where easy samples dominate on-policy batches — is a concrete, addressable bottleneck in LLM RL post-training; practitioners should audit their batch composition for difficulty diversity
  • Off-policy methods require importance sampling and stability guarantees to work reliably in LLM post-training; the 40.7% resolution rate on previously unsolvable problems suggests targeted hard-sample replay is especially valuable for expanding model capability boundaries

Abstract

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

Generated on 2026-03-02 using Claude