← Back to Papers

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Bo Yuan, Junbo Huang, Haochen Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao
2026
AMA-Bench is a new benchmark for evaluating long-horizon memory in LLM-based autonomous agents, featuring real-world and synthetic agentic trajectories composed of machine-generated agent-environment interactions rather than dialogue-centric exchanges. The authors also propose AMA-Agent, a memory system with a causality graph and tool-augmented retrieval that significantly outperforms existing baselines.

Problem Statement

Existing agent memory benchmarks focus on dialogue-centric human-agent interactions, failing to capture the reality of agentic deployments where memory consists of continuous streams of machine-generated agent-environment interactions. This mismatch means current evaluations do not reflect real-world performance, and existing memory systems underperform in practice due to lack of causal/objective information and lossy similarity-based retrieval. There is a critical need for benchmarks and memory systems designed specifically for long-horizon, multi-step agentic tasks.

Key Novelty

  • AMA-Bench: A new benchmark with real-world agentic trajectories paired with expert-curated QA and synthetic trajectories scalable to arbitrary horizons with rule-based QA, specifically designed for agent-environment interaction memory rather than dialogue memory
  • Identification of key failure modes in existing memory systems: lack of causality/objective information and lossy similarity-based retrieval in long-horizon agentic contexts
  • AMA-Agent: A novel memory system featuring a causality graph to capture structured relationships between actions/events and tool-augmented retrieval to overcome limitations of pure embedding-based search

Evaluation Highlights

  • AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baseline by 11.16 percentage points
  • Comprehensive study across representative agentic applications and arbitrary-horizon synthetic trajectories demonstrates consistent underperformance of existing memory systems on agent-centric memory tasks

Breakthrough Assessment

6/10 AMA-Bench fills a genuine and important gap in agentic AI evaluation infrastructure, and the causality graph + tool-augmented retrieval approach is a meaningful architectural advance. However, the absolute performance (57.22%) leaves substantial room for improvement, and the core ideas (graph-based memory, tool-augmented retrieval) build incrementally on existing concepts rather than representing a paradigm shift.

Methodology

  1. Construct AMA-Bench by collecting real-world agentic trajectories across representative agentic applications (e.g., coding, web browsing, tool use) with expert-curated QA pairs, and generating synthetic trajectories scalable to arbitrary lengths with rule-based QA for controlled evaluation
  2. Diagnose failure modes of existing memory systems on AMA-Bench, identifying that similarity-based retrieval is lossy and that causal/objective information is absent from standard memory representations
  3. Design and evaluate AMA-Agent, which builds a causality graph over agent-environment interactions to capture action-outcome relationships, and employs tool-augmented retrieval (e.g., structured queries, programmatic lookups) to complement or replace pure embedding-based similarity search

System Components

AMA-Bench (Real-World Split)

Curated real-world agentic trajectories from representative applications (e.g., web agents, code agents) paired with expert-written QA that tests memory retrieval and reasoning over long interaction histories

AMA-Bench (Synthetic Split)

Programmatically generated agentic trajectories scalable to arbitrary horizons with rule-based QA, enabling stress-testing of memory systems at extreme context lengths

Causality Graph

A structured graph representation of agent-environment interactions that captures causal dependencies between actions and their outcomes, enabling more faithful and lossless memory encoding

Tool-Augmented Retrieval

A retrieval mechanism that supplements or replaces similarity-based vector search with programmatic/structured tools (e.g., exact lookup, temporal ordering) to retrieve relevant memory entries more accurately

Results

Benchmark Best Baseline AMA-Agent Delta
AMA-Bench Average Accuracy ~46.06% 57.22% +11.16%

Key Takeaways

  • Dialogue-centric memory benchmarks are insufficient for evaluating agents in real deployments — practitioners should evaluate memory systems on agent-environment interaction traces, not just conversation histories
  • Similarity-based (embedding/vector) retrieval is a significant bottleneck for long-horizon agent memory; incorporating structured retrieval tools and causal graph representations can yield substantial accuracy gains
  • When building memory systems for long-horizon agents, explicitly modeling causality (which actions caused which outcomes) is critical, as purely content-based representations lose the temporal and causal structure needed for accurate retrospective reasoning

Abstract

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.

Generated on 2026-03-11 using Claude