AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Problem Statement
Existing agent memory benchmarks focus on dialogue-centric human-agent interactions, failing to capture the reality of agentic deployments where memory consists of continuous streams of machine-generated agent-environment interactions. This mismatch means current evaluations do not reflect real-world performance, and existing memory systems underperform in practice due to lack of causal/objective information and lossy similarity-based retrieval. There is a critical need for benchmarks and memory systems designed specifically for long-horizon, multi-step agentic tasks.
Key Novelty
- AMA-Bench: A new benchmark with real-world agentic trajectories paired with expert-curated QA and synthetic trajectories scalable to arbitrary horizons with rule-based QA, specifically designed for agent-environment interaction memory rather than dialogue memory
- Identification of key failure modes in existing memory systems: lack of causality/objective information and lossy similarity-based retrieval in long-horizon agentic contexts
- AMA-Agent: A novel memory system featuring a causality graph to capture structured relationships between actions/events and tool-augmented retrieval to overcome limitations of pure embedding-based search
Evaluation Highlights
- AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baseline by 11.16 percentage points
- Comprehensive study across representative agentic applications and arbitrary-horizon synthetic trajectories demonstrates consistent underperformance of existing memory systems on agent-centric memory tasks
Breakthrough Assessment
Methodology
- Construct AMA-Bench by collecting real-world agentic trajectories across representative agentic applications (e.g., coding, web browsing, tool use) with expert-curated QA pairs, and generating synthetic trajectories scalable to arbitrary lengths with rule-based QA for controlled evaluation
- Diagnose failure modes of existing memory systems on AMA-Bench, identifying that similarity-based retrieval is lossy and that causal/objective information is absent from standard memory representations
- Design and evaluate AMA-Agent, which builds a causality graph over agent-environment interactions to capture action-outcome relationships, and employs tool-augmented retrieval (e.g., structured queries, programmatic lookups) to complement or replace pure embedding-based similarity search
System Components
Curated real-world agentic trajectories from representative applications (e.g., web agents, code agents) paired with expert-written QA that tests memory retrieval and reasoning over long interaction histories
Programmatically generated agentic trajectories scalable to arbitrary horizons with rule-based QA, enabling stress-testing of memory systems at extreme context lengths
A structured graph representation of agent-environment interactions that captures causal dependencies between actions and their outcomes, enabling more faithful and lossless memory encoding
A retrieval mechanism that supplements or replaces similarity-based vector search with programmatic/structured tools (e.g., exact lookup, temporal ordering) to retrieve relevant memory entries more accurately
Results
| Benchmark | Best Baseline | AMA-Agent | Delta |
|---|---|---|---|
| AMA-Bench Average Accuracy | ~46.06% | 57.22% | +11.16% |
Key Takeaways
- Dialogue-centric memory benchmarks are insufficient for evaluating agents in real deployments — practitioners should evaluate memory systems on agent-environment interaction traces, not just conversation histories
- Similarity-based (embedding/vector) retrieval is a significant bottleneck for long-horizon agent memory; incorporating structured retrieval tools and causal graph representations can yield substantial accuracy gains
- When building memory systems for long-horizon agents, explicitly modeling causality (which actions caused which outcomes) is critical, as purely content-based representations lose the temporal and causal structure needed for accurate retrospective reasoning
Abstract
Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.