Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context
Problem Statement
Long-context handling is a persistent challenge for LLMs even when context windows are large enough, as models struggle to reliably extract and reason over extended inputs. Existing agentic approaches like RLM decompose contexts via programmatic sub-calls but leave the critical question of program selection largely unaddressed. Poor program selection limits RLM's effectiveness, especially on semantically intensive tasks and contexts within the model's native window where recursion can actually degrade performance.
Key Novelty
- Introduction of SRLM, an uncertainty-aware self-reflective program search framework that uses three complementary intrinsic uncertainty signals (self-consistency, reasoning length, verbalized confidence) to evaluate and select context-interaction programs
- Empirical finding that recursion is not the primary driver of RLM performance—simple self-reflective program search without recursion or self-query can match or surpass full RLM, challenging the core assumption of the RLM paradigm
- Demonstration that self-reflection provides a semantic signal that improves performance on semantically intensive tasks where heuristic program search fails, and yields consistent gains across both short and long contexts unlike RLM
Evaluation Highlights
- SRLM yields up to 22% improvement over RLM under the same time/compute budget across diverse benchmarks, context lengths, and backbone models
- SRLM without explicit recursion matches or surpasses full RLM, while RLM with recursion often degrades performance relative to the base model for contexts within the model's context window
Breakthrough Assessment
Methodology
- Generate multiple candidate context-interaction programs for a given long-context task using the backbone LLM, analogous to the programmatic decomposition step in RLM
- Evaluate each candidate program using three intrinsic uncertainty signals: self-consistency (agreement across multiple runs), reasoning length (as a proxy for deliberation depth), and verbalized confidence (model's expressed certainty in its output)
- Select the program with the lowest aggregate uncertainty signal and execute it to produce the final answer, replacing RLM's heuristic or fixed program selection with semantically informed self-reflection
System Components
Measures agreement across multiple sampled outputs for a candidate program; higher consistency indicates lower uncertainty and a more reliable program choice
Uses the length of the model's reasoning trace as a proxy for uncertainty—longer or more tortured reasoning may indicate the model is less confident about a program's suitability
Extracts the model's explicitly stated confidence in its answer as an additional complementary indicator of internal uncertainty about the chosen program
The overarching framework that aggregates the three uncertainty signals to score and select the best context-interaction program among candidates, replacing heuristic selection in RLM
Inherited from RLM, this component decomposes long-context tasks into structured sub-calls or programs that interact with portions of the context at inference time
Results
| Metric/Benchmark | RLM Baseline | SRLM (This Paper) | Delta |
|---|---|---|---|
| Long-context QA (best case) | State-of-the-art RLM | Up to 22% higher accuracy | +22% (max) |
| Semantically intensive tasks | Heuristic program search insufficient | Self-reflection provides semantic steering | Consistent gains |
| Short/in-window contexts | RLM degrades vs base model | SRLM yields consistent gains | Positive vs negative for RLM |
| SRLM w/o recursion vs full RLM | Full RLM with recursion | Matches or surpasses RLM | ≥0% with less complexity |
Key Takeaways
- Practitioners using RLM-style agentic long-context systems should replace or augment heuristic program selection with uncertainty-aware self-reflection—simple signals like self-consistency and verbalized confidence provide strong semantic guidance for program choice
- Recursion in RLM is not free: for contexts that fit within the model's native window, recursive decomposition can hurt performance, so SRLM's non-recursive variant is a safer and more broadly applicable default
- The three uncertainty signals (self-consistency, reasoning length, verbalized confidence) are intrinsic and require no external labels or additional models, making SRLM straightforward to integrate into existing LLM inference pipelines under a fixed time/compute budget
Abstract
Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.