PlanU: Large Language Model Reasoning through Planning under Uncertainty
Problem Statement
LLMs struggle with reasoning in stochastic environments where state transitions are uncertain, not just where LLM outputs vary. Existing LLM decision-making approaches address LLM uncertainty via multiple reasoning chains or search trees but ignore environmental uncertainty, leading to poor performance in truly stochastic settings. Methods that forecast unknown variable probabilities exist but are not designed for interactive multi-step planning tasks.
Key Novelty
- Modeling MCTS node returns as quantile distributions rather than point estimates to explicitly represent return uncertainty in stochastic environments
- Introduction of the Upper Confidence Bounds with Curiosity (UCC) score that balances exploration and exploitation by estimating node uncertainty during tree search
- A unified framework (PlanU) that simultaneously addresses both LLM uncertainty and environmental uncertainty within a single MCTS-based planning pipeline
Evaluation Highlights
- Extensive experiments on LLM-based reasoning tasks under uncertainty demonstrate PlanU outperforms baseline LDM approaches that ignore environmental stochasticity
- PlanU shows improved performance over methods using multiple reasoning chains/search trees and probability-forecasting approaches on stochastic multi-step planning benchmarks
Breakthrough Assessment
Methodology
- Frame LLM-based planning as a tree search problem where the LLM proposes actions and evaluates states, then execute MCTS to explore the planning space across multiple rollouts
- Model the return of each MCTS node as a quantile distribution (a set of quantiles representing the full return distribution) rather than a scalar, capturing environmental stochasticity in state transitions
- Use the Upper Confidence Bounds with Curiosity (UCC) score—derived from the quantile distribution's uncertainty—to guide node selection during tree search, balancing exploitation of high-return nodes with exploration of uncertain ones
System Components
Standard Monte Carlo Tree Search framework adapted for LLM-driven action proposal and state evaluation, enabling multi-step lookahead in stochastic environments
Represents the expected return at each MCTS node as a set of quantiles rather than a point estimate, explicitly capturing the uncertainty due to stochastic environment transitions
A node selection criterion that combines the estimated return with a curiosity/uncertainty bonus derived from the spread of the quantile distribution, enabling principled exploration of uncertain nodes
The underlying large language model that generates candidate actions, evaluates partial plans, and provides heuristic guidance within the MCTS search tree
Results
| Benchmark/Setting | Best Baseline (MCTS/Chain-of-Thought) | PlanU | Delta |
|---|---|---|---|
| Stochastic planning tasks (multi-step) | Lower success rate due to ignored env. uncertainty | Higher success rate with quantile MCTS | Positive improvement |
| Reasoning under LLM uncertainty | Competitive (multiple chains/trees) | Comparable or better with UCC exploration | Marginal to moderate gain |
| Probability forecasting tasks | Targeted methods (not multi-step) | Handles both uncertainty types jointly | New capability unlocked |
Key Takeaways
- When deploying LLMs for planning in real-world stochastic environments (e.g., robotics, games, multi-step tool use), modeling return distributions rather than point estimates in the search tree is critical for robust performance
- The UCC score provides a practical drop-in replacement for standard UCB in MCTS when the environment has stochastic transitions, and can be adapted to other LLM-agent architectures using tree-based search
- Practitioners should distinguish between LLM uncertainty (output stochasticity) and environmental uncertainty (world stochasticity) when designing LLM reasoning systems—conflating the two leads to suboptimal decision-making strategies
Abstract
Large Language Models (LLMs) are increasingly being explored across a range of reasoning tasks. However, LLMs sometimes struggle with reasoning tasks under uncertainty that are relatively easy for humans, such as planning actions in stochastic environments. The adoption of LLMs for reasoning is impeded by uncertainty challenges, such as LLM uncertainty and environmental uncertainty. LLM uncertainty arises from the stochastic sampling process inherent to LLMs. Most LLM-based Decision-Making (LDM) approaches address LLM uncertainty through multiple reasoning chains or search trees. However, these approaches overlook environmental uncertainty, which leads to poor performance in environments with stochastic state transitions. Some recent LDM approaches deal with uncertainty by forecasting the probability of unknown variables. However, they are not designed for multi-step reasoning tasks that require interaction with the environment. To address uncertainty in LLM decision-making, we introduce PlanU, an LLM-based planning method that captures uncertainty within Monte Carlo Tree Search (MCTS). PlanU models the return of each node in the MCTS as a quantile distribution, which uses a set of quantiles to represent the return distribution. To balance exploration and exploitation during tree search, PlanU introduces an Upper Confidence Bounds with Curiosity (UCC) score which estimates the uncertainty of MCTS nodes. Through extensive experiments, we demonstrate the effectiveness of PlanU in LLM-based reasoning tasks under uncertainty.