← Back to Papers

Reinforcement World Model Learning for LLM-based Agents

Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jian Gao, Zhou Yu
2026
RWML (Reinforcement World Model Learning) trains LLM-based agents to predict action consequences using sim-to-real gap rewards in a pre-trained embedding space, enabling self-supervised world model learning without expert demonstrations.

Problem Statement

LLMs excel at language tasks but fail as agents because they cannot reliably anticipate the consequences of actions or adapt to environment dynamics. Existing approaches like next-token prediction suffer from prioritizing lexical fidelity over semantic equivalence, and LLM-as-a-judge reward signals are prone to reward hacking. There is a lack of robust, self-supervised methods for grounding LLM agents in environment dynamics without costly expert data.

Key Novelty

  • Sim-to-real gap reward signal that measures semantic alignment between predicted next states and actual observed next states in a pre-trained embedding space, avoiding token-level fidelity issues
  • Fully self-supervised world model learning framework (RWML) requiring no expert demonstrations, yet competitive with expert-data training when combined with task-success rewards
  • Demonstrated robustness advantages over both next-state token prediction (avoids model collapse) and LLM-as-a-judge (less susceptible to reward hacking)

Evaluation Highlights

  • RWML combined with task-success rewards outperforms direct task-success RL by 6.9 points on ALFWorld and 5.7 points on τ²Bench
  • RWML matches the performance of expert-data supervised training while being entirely self-supervised, validated on two diverse agentic benchmarks (ALFWorld and τ²Bench)

Breakthrough Assessment

7/10 RWML presents a significant advance by introducing a principled, self-supervised world modeling signal for LLM agents that addresses key failure modes of prior approaches (token-level collapse, reward hacking) and closes the gap with expert-data methods without labeled demonstrations.

Methodology

  1. The LLM agent generates action-conditioned next-state predictions (simulated states) given current textual observations and candidate actions, functioning as a world model
  2. Simulated next states and actual observed next states from environment interaction are both encoded in a pre-trained embedding space, and a sim-to-real gap reward is computed as their semantic similarity
  3. The world model is optimized via reinforcement learning using the sim-to-real gap reward (self-supervised stage), and optionally combined with task-success rewards for further performance gains

System Components

World Model (LLM)

The LLM parameterized to predict action-conditioned next textual states given current state and action, serving as an internal simulator of environment dynamics

Sim-to-Real Gap Reward

A semantic similarity score computed between the model's predicted next state and the actual observed next state in a pre-trained embedding space, used as the RL training signal

Pre-trained Embedding Space

An external embedding model used to encode both simulated and real next states, enabling semantic comparison that is robust to surface-level wording differences

Task-Success Reward Combination

Optional integration of environment task-success signals alongside the sim-to-real reward to jointly optimize world modeling and task performance

Results

Benchmark Direct Task-Success RL RWML + Task-Success RL Delta
ALFWorld Baseline RL score Baseline + 6.9 pts +6.9 points
τ²Bench Baseline RL score Baseline + 5.7 pts +5.7 points
ALFWorld (self-supervised only) Base LLM Significant gain Positive vs base
Expert-Data Training Match Expert supervised Matched by RWML+RL ≈ parity

Key Takeaways

  • Using semantic embedding-space rewards instead of token-level prediction objectives is a more robust training signal for world models in text-based agentic environments, preventing model collapse
  • Self-supervised world model pre-training via sim-to-real gap rewards can substitute for expensive expert demonstrations and serves as an effective auxiliary reward when combined with sparse task-success signals
  • Practitioners building LLM agents for interactive environments should consider world model auxiliary objectives as a low-cost way to improve sample efficiency and performance of RL fine-tuning without labeled data

Abstract

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $\tau^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $\tau^2$ Bench respectively, while matching the performance of expert-data training.

Generated on 2026-03-02 using Claude