← Back to Papers

Monte Carlo Planning with Large Language Model for Text-Based Game Agents

Zijing Shi, Meng Fang, Ling Chen
International Conference on Learning Representations | 2025
MC-DML integrates Large Language Models with Monte Carlo Tree Search by equipping LLMs with dynamic in-trial and cross-trial memory mechanisms, enabling efficient language-grounded planning in text-based games without requiring multiple costly RL iterations.

Problem Statement

Traditional planning-then-learning approaches like MCTS+RL are computationally expensive due to extensive iterative rollouts. These methods rely on uncertainty-driven exploration but lack the semantic understanding needed to reason about natural language action spaces. There is a gap between the exploratory power of tree search and the language comprehension capabilities needed for text-based game environments.

Key Novelty

  • Introduction of MC-DML: a hybrid algorithm combining MCTS with LLM-guided action evaluation, replacing uncertainty-only heuristics with language-aware reasoning
  • In-trial memory mechanism that allows the LLM to accumulate and leverage observations within a single planning episode for dynamic action re-evaluation
  • Cross-trial memory mechanism that enables the agent to learn from past game episodes and transfer experiential knowledge to improve future planning iterations

Evaluation Highlights

  • MC-DML significantly outperforms strong contemporary MCTS+RL baselines on the Jericho benchmark suite across multiple text-based games at the initial planning phase
  • The algorithm achieves competitive or superior performance without requiring multiple costly RL training iterations, demonstrating greater sample and time efficiency

Breakthrough Assessment

6/10 MC-DML is a solid and practically relevant contribution that meaningfully bridges LLM reasoning with tree search planning, demonstrating efficiency gains over RL-heavy baselines; however, it is an incremental integration of existing techniques (MCTS + LLMs + memory) rather than a fundamental algorithmic or theoretical breakthrough.

Methodology

  1. Construct a Monte Carlo Tree Search planning framework where the LLM serves as the policy/value estimator, replacing or augmenting traditional neural network rollout policies with language-grounded action scoring
  2. Equip the LLM with an in-trial memory module that records observations, actions, and outcomes within the current game episode, enabling the model to dynamically refine action evaluations as new information is gathered
  3. Implement a cross-trial memory module that persists and retrieves relevant experiences across game episodes, allowing the agent to ground its reasoning in historical successes and failures during the MCTS selection and expansion phases

System Components

Monte Carlo Tree Search (MCTS) Backbone

Provides the structured exploratory planning framework, managing the selection, expansion, simulation, and backpropagation phases of the search tree

LLM Policy/Value Estimator

Replaces traditional heuristic or learned neural network evaluators with an LLM that scores candidate actions using natural language understanding and commonsense reasoning

In-Trial Memory Module

Stores observations, actions, and rewards within the current planning episode, allowing the LLM to dynamically update its action evaluations based on accumulating context

Cross-Trial Memory Module

Maintains a persistent memory store of experiences across multiple game episodes, retrieved at planning time to inform the LLM with relevant historical knowledge

Jericho Benchmark Evaluation Suite

A collection of diverse text-based games used to evaluate agent performance across varied interactive fiction environments

Results

Benchmark MCTS+RL Baseline MC-DML (Ours) Delta
Jericho (avg. across games) Strong contemporary multi-iteration baseline Outperforms at initial planning phase Significant improvement without multi-iteration RL
Planning Efficiency Requires extensive RL iterations Single/few planning phases sufficient Substantially reduced compute cost
Language Reasoning Uncertainty-driven only, no semantic understanding LLM-guided semantic action evaluation Qualitative leap in action scoring quality

Key Takeaways

  • LLMs can serve as effective plug-in policy and value estimators within classical tree search frameworks, reducing reliance on expensive RL training loops while improving semantic action reasoning
  • Memory mechanisms (both within-episode and across-episodes) are critical for LLM agents to improve over time in interactive environments — stateless LLM calls are insufficient for complex sequential decision-making
  • For practitioners building game or simulation agents, the MC-DML pattern offers a practical template: use MCTS for structured exploration, LLMs for language-grounded evaluation, and episodic memory for continual improvement without full retraining

Abstract

Text-based games provide valuable environments for language-based autonomous agents. However, planning-then-learning paradigms, such as those combining Monte Carlo Tree Search (MCTS) and reinforcement learning (RL), are notably time-consuming due to extensive iterations. Additionally, these algorithms perform uncertainty-driven exploration but lack language understanding and reasoning abilities. In this paper, we introduce the Monte Carlo planning with Dynamic Memory-guided Large language model (MC-DML) algorithm. MC-DML leverages the language understanding and reasoning capabilities of Large Language Models (LLMs) alongside the exploratory advantages of tree search algorithms. Specifically, we enhance LLMs with in-trial and cross-trial memory mechanisms, enabling them to learn from past experiences and dynamically adjust action evaluations during planning. We conduct experiments on a series of text-based games from the Jericho benchmark. Our results demonstrate that the MC-DML algorithm significantly enhances performance across various games at the initial planning phase, outperforming strong contemporary methods that require multiple iterations. This demonstrates the effectiveness of our algorithm, paving the way for more efficient language-grounded planning in complex environments.

Generated on 2026-03-03 using Claude