Coding Agents are Effective Long-Context Processors
Problem Statement
LLMs suffer significant performance degradation as context length increases, and their internal attention-based context processing is latent and uninterpretable. Existing approaches like semantic search or extending context windows have fundamental limitations in handling truly massive corpora (up to trillions of tokens). There is a need for an alternative paradigm that can reliably process long contexts without relying solely on the model's internal capacity.
Key Novelty
- Reframing long-context processing as an agentic task: externalizing context handling from latent attention into explicit, executable code and terminal commands via coding agents
- Demonstrating that off-the-shelf frontier coding agents, without task-specific fine-tuning, serve as a general-purpose interface for diverse long-context tasks including reasoning, RAG, and open-domain QA over 3-trillion-token corpora
- Identifying two key mechanisms behind agent efficacy: native tool proficiency (executable code/terminal commands over passive semantic queries) and file system familiarity (treating corpora as navigable directory structures)
Evaluation Highlights
- Coding agents outperform published state-of-the-art across multiple long-context benchmarks by 17.3% on average
- The approach scales to open-domain question answering over corpora containing up to three trillion tokens, far exceeding what any context window can accommodate
Breakthrough Assessment
Methodology
- Frame long-context tasks (reasoning, RAG, open-domain QA) as agentic problems where the agent must organize and manipulate text stored in file systems using code and terminal tools
- Deploy off-the-shelf frontier coding agents (no task-specific fine-tuning) as the unified interface, allowing them to write and execute scripts to search, filter, index, and retrieve relevant information from large text corpora stored as directory structures
- Evaluate on multiple long-context benchmarks spanning diverse task types and corpus scales up to 3 trillion tokens, comparing against published state-of-the-art baselines including semantic search and extended context window methods
System Components
Off-the-shelf frontier LLM-based coding agents that generate and execute code/terminal commands to process text, serving as a general-purpose long-context processor
Text corpora are organized as directory and file structures, allowing agents to navigate massive datasets using familiar OS-level abstractions rather than semantic embeddings
Agents leverage executable code (Python scripts, grep, awk, etc.) and terminal commands for precise, deterministic text retrieval and manipulation, replacing passive semantic similarity search
Evaluation framework covering long-context reasoning, retrieval-augmented generation, and open-domain QA benchmarks with corpora ranging to 3 trillion tokens
Results
| Benchmark Category | Prior SOTA | Coding Agents | Delta |
|---|---|---|---|
| Long-Context Reasoning | Prior SOTA | New SOTA | +17.3% avg across benchmarks |
| Retrieval-Augmented Generation | Semantic Search / Extended Window | Coding Agent | Part of +17.3% avg improvement |
| Open-Domain QA (up to 3T tokens) | Semantic Search Baselines | Coding Agent | Part of +17.3% avg improvement |
Key Takeaways
- Practitioners building long-context applications should consider coding agents with file-system-organized corpora as a practical alternative to vector databases or extended context windows—especially for very large-scale corpora where window scaling is infeasible
- Off-the-shelf coding agents (e.g., GPT-4-based code agents) can be repurposed for long-context NLP tasks without fine-tuning, lowering the barrier to building high-performance long-context systems
- The key design insight is to treat text corpora as file systems and retrieval/reasoning as programming tasks, enabling agents to use grep, indexing scripts, and other shell utilities that are more precise and scalable than semantic similarity search
Abstract
Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.