Recursive Language Models

Recursive Language Models (RLMs) enable LLMs to process arbitrarily long contexts by treating the prompt as an external environment and recursively decomposing it into manageable snippets via programmatic self-calls. This inference paradigm frames long-context processing as an inference-time scaling problem rather than a training-time architectural constraint.

Problem Statement

Current LLMs are bounded by fixed context windows, making them unable to handle documents, codebases, or datasets that exceed those limits without lossy truncation or expensive retrieval heuristics. Existing long-context scaffolds (e.g., RAG, sliding windows) often degrade quality or fail entirely on tasks requiring global reasoning across very long inputs. There is no principled, general framework that allows a model to adaptively decompose and reason over inputs orders of magnitude larger than its context window.

Key Novelty

A general inference paradigm (RLMs) that allows an LLM to programmatically decompose long prompts and recursively invoke itself over sub-segments, treating the full prompt as an external environment
Post-training of the first natively recursive language model (RLM-Qwen3-8B), demonstrating that recursive behavior can be learned and internalized rather than only scaffolded at inference time
Demonstration that RLMs can process inputs up to two orders of magnitude beyond the model's native context window while outperforming frontier LLMs and common long-context baselines on four diverse tasks at comparable cost

Evaluation Highlights

RLM-Qwen3-8B outperforms the base Qwen3-8B model by 28.3% on average across long-context benchmarks
RLM-Qwen3-8B approaches the quality of vanilla GPT-5 on three long-context tasks, despite being a much smaller open model

Breakthrough Assessment

8/10 RLMs represent a significant conceptual and practical advance by reframing long-context processing as inference-time scaling with recursive self-decomposition, achieving strong results even at 8B scale against frontier models. While recursive/hierarchical processing ideas exist, the unified paradigm, post-training methodology, and empirical scope across inputs far beyond context limits constitute a meaningful leap beyond current scaffolding approaches.

Methodology

Frame the long-context problem as inference-time scaling: rather than extending the context window, allow the LLM to act as an agent over its own prompt, programmatically examining and decomposing the input into context-sized snippets
Implement the RLM inference paradigm where the model can issue recursive self-calls on sub-segments of the prompt, accumulating and synthesizing intermediate results to answer queries requiring global reasoning
Post-train a base model (Qwen3-8B) on recursive decomposition tasks to produce RLM-Qwen3-8B, a natively recursive model that has internalized the decomposition strategy rather than relying solely on prompt engineering or external scaffolding

System Components

Recursive Inference Paradigm

A general framework where the LLM treats long prompts as an external environment, issuing programmatic calls to examine sub-segments and recursively invoking itself to process each chunk

Prompt Decomposition Engine

The mechanism by which the model decides how to split a long input into manageable snippets and determines the order and structure of recursive sub-calls

RLM-Qwen3-8B

A post-trained 8B parameter model that has internalized recursive long-context reasoning, trained to natively apply the RLM paradigm rather than relying on external scaffolding

Intermediate Result Aggregation

The process by which the model synthesizes outputs from recursive sub-calls into a coherent final answer, enabling global reasoning over very long inputs

Results

Metric/Benchmark	Baseline	This Paper (RLM-Qwen3-8B)	Delta
Avg. long-context tasks (4 tasks)	Qwen3-8B (vanilla)	RLM-Qwen3-8B	+28.3%
Long-context quality (3 tasks)	Vanilla GPT-5	Approaches GPT-5 quality	Competitive
Maximum input length handled	Model context window	~100x context window	2 orders of magnitude
Cost vs. common long-context scaffolds	Comparable cost	Comparable cost	No regression

Key Takeaways

Practitioners facing long-document tasks can use the RLM inference paradigm as a drop-in scaffold with existing frontier LLMs, achieving better quality than RAG or sliding-window approaches at comparable cost without retraining
Post-training smaller open models (8B scale) on recursive decomposition objectives is a viable and cost-effective path to long-context capability that can rival much larger proprietary models on targeted tasks
Framing long-context processing as inference-time scaling (recursive self-calls) rather than a context-window engineering problem opens a new design space: the quality of long-context reasoning can be improved by allocating more inference compute to decomposition depth and breadth

Abstract

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.