Recursive Language Models
Problem Statement
Current LLMs are bounded by fixed context windows, making them unable to handle documents, codebases, or datasets that exceed those limits without lossy truncation or expensive retrieval heuristics. Existing long-context scaffolds (e.g., RAG, sliding windows) often degrade quality or fail entirely on tasks requiring global reasoning across very long inputs. There is no principled, general framework that allows a model to adaptively decompose and reason over inputs orders of magnitude larger than its context window.
Key Novelty
- A general inference paradigm (RLMs) that allows an LLM to programmatically decompose long prompts and recursively invoke itself over sub-segments, treating the full prompt as an external environment
- Post-training of the first natively recursive language model (RLM-Qwen3-8B), demonstrating that recursive behavior can be learned and internalized rather than only scaffolded at inference time
- Demonstration that RLMs can process inputs up to two orders of magnitude beyond the model's native context window while outperforming frontier LLMs and common long-context baselines on four diverse tasks at comparable cost
Evaluation Highlights
- RLM-Qwen3-8B outperforms the base Qwen3-8B model by 28.3% on average across long-context benchmarks
- RLM-Qwen3-8B approaches the quality of vanilla GPT-5 on three long-context tasks, despite being a much smaller open model
Breakthrough Assessment
Methodology
- Frame the long-context problem as inference-time scaling: rather than extending the context window, allow the LLM to act as an agent over its own prompt, programmatically examining and decomposing the input into context-sized snippets
- Implement the RLM inference paradigm where the model can issue recursive self-calls on sub-segments of the prompt, accumulating and synthesizing intermediate results to answer queries requiring global reasoning
- Post-train a base model (Qwen3-8B) on recursive decomposition tasks to produce RLM-Qwen3-8B, a natively recursive model that has internalized the decomposition strategy rather than relying solely on prompt engineering or external scaffolding
System Components
A general framework where the LLM treats long prompts as an external environment, issuing programmatic calls to examine sub-segments and recursively invoking itself to process each chunk
The mechanism by which the model decides how to split a long input into manageable snippets and determines the order and structure of recursive sub-calls
A post-trained 8B parameter model that has internalized recursive long-context reasoning, trained to natively apply the RLM paradigm rather than relying on external scaffolding
The process by which the model synthesizes outputs from recursive sub-calls into a coherent final answer, enabling global reasoning over very long inputs
Results
| Metric/Benchmark | Baseline | This Paper (RLM-Qwen3-8B) | Delta |
|---|---|---|---|
| Avg. long-context tasks (4 tasks) | Qwen3-8B (vanilla) | RLM-Qwen3-8B | +28.3% |
| Long-context quality (3 tasks) | Vanilla GPT-5 | Approaches GPT-5 quality | Competitive |
| Maximum input length handled | Model context window | ~100x context window | 2 orders of magnitude |
| Cost vs. common long-context scaffolds | Comparable cost | Comparable cost | No regression |
Key Takeaways
- Practitioners facing long-document tasks can use the RLM inference paradigm as a drop-in scaffold with existing frontier LLMs, achieving better quality than RAG or sliding-window approaches at comparable cost without retraining
- Post-training smaller open models (8B scale) on recursive decomposition objectives is a viable and cost-effective path to long-context capability that can rival much larger proprietary models on targeted tasks
- Framing long-context processing as inference-time scaling (recursive self-calls) rather than a context-window engineering problem opens a new design space: the quality of long-context reasoning can be improved by allocating more inference compute to decomposition depth and breadth
Abstract
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.