← Back to Papers

PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

Daoyu Wang, Mingyue Cheng, Qi Liu, Shuo Yu, Zirui Liu, Ze Guo
arXiv.org | 2025
PaperArena is a benchmark designed to evaluate LLM-based agents on complex scientific reasoning tasks that require integrating information across multiple papers using external tools. It exposes significant limitations in current state-of-the-art agents on cross-paper, multi-tool reasoning challenges.

Problem Statement

Existing benchmarks for scientific literature understanding are limited to single-paper, tool-free tasks, which fail to capture the complexity of real-world research workflows. There is a critical lack of evaluation infrastructure for cross-paper reasoning and multi-tool orchestration, leaving a gap between benchmark performance and authentic research capabilities. This prevents the community from accurately assessing or improving LLM agents for scientific discovery.

Key Novelty

  • Introduction of PaperArena, the first benchmark specifically targeting cross-paper reasoning with multi-tool orchestration in authentic scientific research scenarios
  • A modular tool environment platform supporting multimodal parsing, context retrieval, and programmatic computation for standardized agent evaluation
  • Detailed analysis of reasoning traces and agent behavior diagnostics, providing actionable insights for developing more capable scientific agents

Evaluation Highlights

  • Leading LLM-powered agents achieve only 38.78% average accuracy on the full benchmark, revealing substantial room for improvement
  • On the hard subset, accuracy drops to 18.47%, highlighting that even state-of-the-art agentic workflows struggle with complex multi-paper reasoning tasks

Breakthrough Assessment

6/10 PaperArena is a solid and timely contribution that fills a clear gap in scientific LLM evaluation, but as a benchmark paper rather than a modeling advance, its impact is enabling rather than paradigm-shifting. It establishes a rigorous new evaluation frontier for agentic scientific reasoning.

Methodology

  1. Curate research questions requiring integration of information across multiple scientific papers, categorized by difficulty, to form the benchmark dataset
  2. Deploy a standardized agent execution platform equipped with modular tools (multimodal parsing, context retrieval, programmatic computation) enabling reproducible multi-tool agentic workflows
  3. Evaluate leading LLM-based agents on the benchmark, analyze reasoning traces, and diagnose failure modes to provide community insights

System Components

Benchmark Dataset

A collection of research questions requiring cross-paper reasoning, including a hard subset, designed to reflect authentic scientific inquiry scenarios

Multimodal Parsing Tool

Extracts and interprets text, figures, tables, and equations from scientific papers to support comprehensive information retrieval

Context Retrieval Tool

Enables agents to search and retrieve relevant passages across multiple papers to ground their reasoning

Programmatic Computation Tool

Allows agents to perform calculations, statistical analyses, or other computational tasks needed to answer quantitative research questions

Agent Execution Platform

A modular infrastructure that standardizes how agents interact with tools and papers, ensuring reproducible and comparable evaluation across different LLM agents

Results

Metric/Benchmark Human / Oracle Baseline Best LLM Agent Gap
Average Accuracy (Full Set) ~100% (assumed human expert) 38.78% -61.22%
Average Accuracy (Hard Subset) ~100% (assumed human expert) 18.47% -81.53%

Key Takeaways

  • Current LLM agents are far from solving real-world scientific reasoning tasks — practitioners should not assume strong benchmark performance on single-paper tasks generalizes to multi-paper, multi-tool settings
  • Building robust scientific agents requires explicit support for tool orchestration (parsing, retrieval, computation) and cross-document reasoning, areas where current architectures are clearly deficient
  • PaperArena provides a ready-to-use evaluation platform and diagnostic framework that ML practitioners can leverage to benchmark and iteratively improve their scientific agentic systems

Abstract

Understanding and reasoning on the large-scale scientific literature is a crucial touchstone for large language model (LLM) based agents. However, existing works are mainly restricted to tool-free tasks within single papers, largely due to the lack of a benchmark that evaluates cross-paper reasoning and multi-tool orchestration in authentic research scenarios. In this work, we propose PaperArena, a benchmark to evaluate LLM-based agents on questions that require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should formulate a reasoning plan, interact with multiple papers, and invoke appropriate tools to produce a well-grounded answer. To support standardized evaluation, we provide a platform for agent execution, offering a modular tool environment including multimodal parsing, context retrieval, and programmatic computation. Experiments reveal that even the leading LLM powering a well-established agentic workflow achieves merely 38.78% average accuracy, while on the hard subset, accuracy drops to only 18.47%. We also analyze reasoning traces and diagnose agent behavior, providing the community with insights to develop and evaluate more capable scientific agents.

Generated on 2026-02-21 using Claude