Coding Agents are Effective Long-Context Processors

Coding agents can externalize long-context processing from opaque attention mechanisms into explicit, executable interactions with file systems and native tools, dramatically outperforming traditional LLM-based long-context approaches. This reframes long-context processing as an agentic file-system navigation problem rather than a context-window scaling problem.

Problem Statement

LLMs suffer significant performance degradation as context length increases, and their internal attention-based context processing is latent and uninterpretable. Existing approaches like semantic search or extending context windows have fundamental limitations in handling truly massive corpora (up to trillions of tokens). There is a need for an alternative paradigm that can reliably process long contexts without relying solely on the model's internal capacity.

Key Novelty

Reframing long-context processing as an agentic task: externalizing context handling from latent attention into explicit, executable code and terminal commands via coding agents
Demonstrating that off-the-shelf frontier coding agents, without task-specific fine-tuning, serve as a general-purpose interface for diverse long-context tasks including reasoning, RAG, and open-domain QA over 3-trillion-token corpora
Identifying two key mechanisms behind agent efficacy: native tool proficiency (executable code/terminal commands over passive semantic queries) and file system familiarity (treating corpora as navigable directory structures)

Evaluation Highlights

Coding agents outperform published state-of-the-art across multiple long-context benchmarks by 17.3% on average
The approach scales to open-domain question answering over corpora containing up to three trillion tokens, far exceeding what any context window can accommodate

Breakthrough Assessment

8/10 This work presents a genuinely novel paradigm shift in how long-context processing is approached—moving from scaling attention windows to leveraging agentic tool use—with strong empirical results and broad applicability. However, it stops short of a full paradigm shift as it relies on existing frontier models and does not introduce new training methodologies.

Methodology

Frame long-context tasks (reasoning, RAG, open-domain QA) as agentic problems where the agent must organize and manipulate text stored in file systems using code and terminal tools
Deploy off-the-shelf frontier coding agents (no task-specific fine-tuning) as the unified interface, allowing them to write and execute scripts to search, filter, index, and retrieve relevant information from large text corpora stored as directory structures
Evaluate on multiple long-context benchmarks spanning diverse task types and corpus scales up to 3 trillion tokens, comparing against published state-of-the-art baselines including semantic search and extended context window methods

System Components

Coding Agent Interface

Off-the-shelf frontier LLM-based coding agents that generate and execute code/terminal commands to process text, serving as a general-purpose long-context processor

File System Representation

Text corpora are organized as directory and file structures, allowing agents to navigate massive datasets using familiar OS-level abstractions rather than semantic embeddings

Native Tool Proficiency

Agents leverage executable code (Python scripts, grep, awk, etc.) and terminal commands for precise, deterministic text retrieval and manipulation, replacing passive semantic similarity search

Long-Context Task Suite

Evaluation framework covering long-context reasoning, retrieval-augmented generation, and open-domain QA benchmarks with corpora ranging to 3 trillion tokens

Results

Benchmark Category	Prior SOTA	Coding Agents	Delta
Long-Context Reasoning	Prior SOTA	New SOTA	+17.3% avg across benchmarks
Retrieval-Augmented Generation	Semantic Search / Extended Window	Coding Agent	Part of +17.3% avg improvement
Open-Domain QA (up to 3T tokens)	Semantic Search Baselines	Coding Agent	Part of +17.3% avg improvement

Key Takeaways

Practitioners building long-context applications should consider coding agents with file-system-organized corpora as a practical alternative to vector databases or extended context windows—especially for very large-scale corpora where window scaling is infeasible
Off-the-shelf coding agents (e.g., GPT-4-based code agents) can be repurposed for long-context NLP tasks without fine-tuning, lowering the barrier to building high-performance long-context systems
The key design insight is to treat text corpora as file systems and retrieval/reasoning as programming tasks, enabling agents to use grep, indexing scripts, and other shell utilities that are more precise and scalable than semantic similarity search

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.