Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

SRLM augments Recursive Language Models with uncertainty-aware self-reflection using three intrinsic signals (self-consistency, reasoning length, verbalized confidence) to select better context-interaction programs for long-context reasoning. This approach consistently outperforms RLM and reveals that recursion itself is not the primary driver of RLM's performance gains.

Problem Statement

Long-context handling is a persistent challenge for LLMs even when context windows are large enough, as models struggle to reliably extract and reason over extended inputs. Existing agentic approaches like RLM decompose contexts via programmatic sub-calls but leave the critical question of program selection largely unaddressed. Poor program selection limits RLM's effectiveness, especially on semantically intensive tasks and contexts within the model's native window where recursion can actually degrade performance.

Key Novelty

Introduction of SRLM, an uncertainty-aware self-reflective program search framework that uses three complementary intrinsic uncertainty signals (self-consistency, reasoning length, verbalized confidence) to evaluate and select context-interaction programs
Empirical finding that recursion is not the primary driver of RLM performance—simple self-reflective program search without recursion or self-query can match or surpass full RLM, challenging the core assumption of the RLM paradigm
Demonstration that self-reflection provides a semantic signal that improves performance on semantically intensive tasks where heuristic program search fails, and yields consistent gains across both short and long contexts unlike RLM

Evaluation Highlights

SRLM yields up to 22% improvement over RLM under the same time/compute budget across diverse benchmarks, context lengths, and backbone models
SRLM without explicit recursion matches or surpasses full RLM, while RLM with recursion often degrades performance relative to the base model for contexts within the model's context window

Breakthrough Assessment

6/10 SRLM offers a solid and practically useful contribution by identifying a critical gap in RLM program selection and providing a principled uncertainty-aware solution, but the core idea of self-consistency and verbalized confidence for selection is incremental relative to existing uncertainty estimation literature; the key insight challenging RLM's recursion assumption elevates its significance.

Methodology

Generate multiple candidate context-interaction programs for a given long-context task using the backbone LLM, analogous to the programmatic decomposition step in RLM
Evaluate each candidate program using three intrinsic uncertainty signals: self-consistency (agreement across multiple runs), reasoning length (as a proxy for deliberation depth), and verbalized confidence (model's expressed certainty in its output)
Select the program with the lowest aggregate uncertainty signal and execute it to produce the final answer, replacing RLM's heuristic or fixed program selection with semantically informed self-reflection

System Components

Self-Consistency Signal

Measures agreement across multiple sampled outputs for a candidate program; higher consistency indicates lower uncertainty and a more reliable program choice

Reasoning Length Signal

Uses the length of the model's reasoning trace as a proxy for uncertainty—longer or more tortured reasoning may indicate the model is less confident about a program's suitability

Verbalized Confidence Signal

Extracts the model's explicitly stated confidence in its answer as an additional complementary indicator of internal uncertainty about the chosen program

SRLM Framework

The overarching framework that aggregates the three uncertainty signals to score and select the best context-interaction program among candidates, replacing heuristic selection in RLM

Programmatic Context Interaction

Inherited from RLM, this component decomposes long-context tasks into structured sub-calls or programs that interact with portions of the context at inference time

Results

Metric/Benchmark	RLM Baseline	SRLM (This Paper)	Delta
Long-context QA (best case)	State-of-the-art RLM	Up to 22% higher accuracy	+22% (max)
Semantically intensive tasks	Heuristic program search insufficient	Self-reflection provides semantic steering	Consistent gains
Short/in-window contexts	RLM degrades vs base model	SRLM yields consistent gains	Positive vs negative for RLM
SRLM w/o recursion vs full RLM	Full RLM with recursion	Matches or surpasses RLM	≥0% with less complexity

Key Takeaways

Practitioners using RLM-style agentic long-context systems should replace or augment heuristic program selection with uncertainty-aware self-reflection—simple signals like self-consistency and verbalized confidence provide strong semantic guidance for program choice
Recursion in RLM is not free: for contexts that fit within the model's native window, recursive decomposition can hurt performance, so SRLM's non-recursive variant is a safer and more broadly applicable default
The three uncertainty signals (self-consistency, reasoning length, verbalized confidence) are intrinsic and require no external labels or additional models, making SRLM straightforward to integrate into existing LLM inference pipelines under a fixed time/compute budget

Abstract

Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.