AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

AMA-Bench is a new benchmark for evaluating long-horizon memory in LLM-based autonomous agents, featuring real-world and synthetic agentic trajectories composed of machine-generated agent-environment interactions rather than dialogue-centric exchanges. The authors also propose AMA-Agent, a memory system with a causality graph and tool-augmented retrieval that significantly outperforms existing baselines.

Problem Statement

Existing agent memory benchmarks focus on dialogue-centric human-agent interactions, failing to capture the reality of agentic deployments where memory consists of continuous streams of machine-generated agent-environment interactions. This mismatch means current evaluations do not reflect real-world performance, and existing memory systems underperform in practice due to lack of causal/objective information and lossy similarity-based retrieval. There is a critical need for benchmarks and memory systems designed specifically for long-horizon, multi-step agentic tasks.

Key Novelty

AMA-Bench: A new benchmark with real-world agentic trajectories paired with expert-curated QA and synthetic trajectories scalable to arbitrary horizons with rule-based QA, specifically designed for agent-environment interaction memory rather than dialogue memory
Identification of key failure modes in existing memory systems: lack of causality/objective information and lossy similarity-based retrieval in long-horizon agentic contexts
AMA-Agent: A novel memory system featuring a causality graph to capture structured relationships between actions/events and tool-augmented retrieval to overcome limitations of pure embedding-based search

Evaluation Highlights

AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baseline by 11.16 percentage points
Comprehensive study across representative agentic applications and arbitrary-horizon synthetic trajectories demonstrates consistent underperformance of existing memory systems on agent-centric memory tasks

Breakthrough Assessment

6/10 AMA-Bench fills a genuine and important gap in agentic AI evaluation infrastructure, and the causality graph + tool-augmented retrieval approach is a meaningful architectural advance. However, the absolute performance (57.22%) leaves substantial room for improvement, and the core ideas (graph-based memory, tool-augmented retrieval) build incrementally on existing concepts rather than representing a paradigm shift.

Methodology

Construct AMA-Bench by collecting real-world agentic trajectories across representative agentic applications (e.g., coding, web browsing, tool use) with expert-curated QA pairs, and generating synthetic trajectories scalable to arbitrary lengths with rule-based QA for controlled evaluation
Diagnose failure modes of existing memory systems on AMA-Bench, identifying that similarity-based retrieval is lossy and that causal/objective information is absent from standard memory representations
Design and evaluate AMA-Agent, which builds a causality graph over agent-environment interactions to capture action-outcome relationships, and employs tool-augmented retrieval (e.g., structured queries, programmatic lookups) to complement or replace pure embedding-based similarity search

System Components

AMA-Bench (Real-World Split)

Curated real-world agentic trajectories from representative applications (e.g., web agents, code agents) paired with expert-written QA that tests memory retrieval and reasoning over long interaction histories

AMA-Bench (Synthetic Split)

Programmatically generated agentic trajectories scalable to arbitrary horizons with rule-based QA, enabling stress-testing of memory systems at extreme context lengths

Causality Graph

A structured graph representation of agent-environment interactions that captures causal dependencies between actions and their outcomes, enabling more faithful and lossless memory encoding

Tool-Augmented Retrieval

A retrieval mechanism that supplements or replaces similarity-based vector search with programmatic/structured tools (e.g., exact lookup, temporal ordering) to retrieve relevant memory entries more accurately

Results

Benchmark	Best Baseline	AMA-Agent	Delta
AMA-Bench Average Accuracy	~46.06%	57.22%	+11.16%

Key Takeaways

Dialogue-centric memory benchmarks are insufficient for evaluating agents in real deployments — practitioners should evaluate memory systems on agent-environment interaction traces, not just conversation histories
Similarity-based (embedding/vector) retrieval is a significant bottleneck for long-horizon agent memory; incorporating structured retrieval tools and causal graph representations can yield substantial accuracy gains
When building memory systems for long-horizon agents, explicitly modeling causality (which actions caused which outcomes) is critical, as purely content-based representations lose the temporal and causal structure needed for accurate retrospective reasoning

Abstract

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.