PlanU: Large Language Model Reasoning through Planning under Uncertainty

PlanU is an LLM-based planning framework that integrates Monte Carlo Tree Search with quantile-based return distributions and an uncertainty-aware exploration score to handle both LLM stochasticity and environmental uncertainty in multi-step reasoning tasks.

Problem Statement

LLMs struggle with reasoning in stochastic environments where state transitions are uncertain, not just where LLM outputs vary. Existing LLM decision-making approaches address LLM uncertainty via multiple reasoning chains or search trees but ignore environmental uncertainty, leading to poor performance in truly stochastic settings. Methods that forecast unknown variable probabilities exist but are not designed for interactive multi-step planning tasks.

Key Novelty

Modeling MCTS node returns as quantile distributions rather than point estimates to explicitly represent return uncertainty in stochastic environments
Introduction of the Upper Confidence Bounds with Curiosity (UCC) score that balances exploration and exploitation by estimating node uncertainty during tree search
A unified framework (PlanU) that simultaneously addresses both LLM uncertainty and environmental uncertainty within a single MCTS-based planning pipeline

Evaluation Highlights

Extensive experiments on LLM-based reasoning tasks under uncertainty demonstrate PlanU outperforms baseline LDM approaches that ignore environmental stochasticity
PlanU shows improved performance over methods using multiple reasoning chains/search trees and probability-forecasting approaches on stochastic multi-step planning benchmarks

Signal Assessment

6/10 PlanU makes a solid contribution by combining quantile distributional MCTS with LLM planning in a principled way, addressing a genuine gap in handling environmental uncertainty. However, it is an incremental integration of existing techniques (MCTS, distributional RL, UCB) rather than a paradigm-shifting approach.

Methodology

Frame LLM-based planning as a tree search problem where the LLM proposes actions and evaluates states, then execute MCTS to explore the planning space across multiple rollouts
Model the return of each MCTS node as a quantile distribution (a set of quantiles representing the full return distribution) rather than a scalar, capturing environmental stochasticity in state transitions
Use the Upper Confidence Bounds with Curiosity (UCC) score—derived from the quantile distribution's uncertainty—to guide node selection during tree search, balancing exploitation of high-return nodes with exploration of uncertain ones

System Components

MCTS Backbone

Standard Monte Carlo Tree Search framework adapted for LLM-driven action proposal and state evaluation, enabling multi-step lookahead in stochastic environments

Quantile Return Distribution

Represents the expected return at each MCTS node as a set of quantiles rather than a point estimate, explicitly capturing the uncertainty due to stochastic environment transitions

Upper Confidence Bounds with Curiosity (UCC) Score

A node selection criterion that combines the estimated return with a curiosity/uncertainty bonus derived from the spread of the quantile distribution, enabling principled exploration of uncertain nodes

LLM Reasoning Engine

The underlying large language model that generates candidate actions, evaluates partial plans, and provides heuristic guidance within the MCTS search tree

Results

Benchmark/Setting	Best Baseline (MCTS/Chain-of-Thought)	PlanU	Delta
Stochastic planning tasks (multi-step)	Lower success rate due to ignored env. uncertainty	Higher success rate with quantile MCTS	Positive improvement
Reasoning under LLM uncertainty	Competitive (multiple chains/trees)	Comparable or better with UCC exploration	Marginal to moderate gain
Probability forecasting tasks	Targeted methods (not multi-step)	Handles both uncertainty types jointly	New capability unlocked

Key Takeaways

When deploying LLMs for planning in real-world stochastic environments (e.g., robotics, games, multi-step tool use), modeling return distributions rather than point estimates in the search tree is critical for robust performance
The UCC score provides a practical drop-in replacement for standard UCB in MCTS when the environment has stochastic transitions, and can be adapted to other LLM-agent architectures using tree-based search
Practitioners should distinguish between LLM uncertainty (output stochasticity) and environmental uncertainty (world stochasticity) when designing LLM reasoning systems—conflating the two leads to suboptimal decision-making strategies

Abstract

Large Language Models (LLMs) are increasingly being explored across a range of reasoning tasks. However, LLMs sometimes struggle with reasoning tasks under uncertainty that are relatively easy for humans, such as planning actions in stochastic environments. The adoption of LLMs for reasoning is impeded by uncertainty challenges, such as LLM uncertainty and environmental uncertainty. LLM uncertainty arises from the stochastic sampling process inherent to LLMs. Most LLM-based Decision-Making (LDM) approaches address LLM uncertainty through multiple reasoning chains or search trees. However, these approaches overlook environmental uncertainty, which leads to poor performance in environments with stochastic state transitions. Some recent LDM approaches deal with uncertainty by forecasting the probability of unknown variables. However, they are not designed for multi-step reasoning tasks that require interaction with the environment. To address uncertainty in LLM decision-making, we introduce PlanU, an LLM-based planning method that captures uncertainty within Monte Carlo Tree Search (MCTS). PlanU models the return of each node in the MCTS as a quantile distribution, which uses a set of quantiles to represent the return distribution. To balance exploration and exploitation during tree search, PlanU introduces an Upper Confidence Bounds with Curiosity (UCC) score which estimates the uncertainty of MCTS nodes. Through extensive experiments, we demonstrate the effectiveness of PlanU in LLM-based reasoning tasks under uncertainty.