← Back to Papers

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour, Tom Mitchell, Estevam Hruschka
arXiv.org | 2025
FuncBenchGen is a contamination-free, controllable benchmark generation framework that evaluates tool-augmented language models (TaLMs) by casting multi-step function calling as traversal over a hidden function-dependency DAG, enabling systematic stress-testing with precise difficulty control.

Problem Statement

Existing TaLM benchmarks are susceptible to data contamination (training data leakage) and lack fine-grained control over task difficulty dimensions such as dependency depth, graph size, and distractor functions. This makes it impossible to isolate failure modes or reliably compare models, and static benchmarks become obsolete as models are trained on their contents. There is also no principled way to diagnose why models fail at multi-step tool use.

Key Novelty

  • FuncBenchGen: a unified synthetic benchmark generation framework that models tool use as DAG traversal, eliminating contamination by generating novel tasks at evaluation time
  • Controllable difficulty axes (graph size, dependency depth, distractor function types including type-compatible 'connected distractors') that allow precise ablation of model capabilities
  • Identification of brittle state tracking as a key failure mode and a lightweight mitigation—explicitly restating prior variable values at each step—that yields large performance gains (e.g., 62.5% → 81.3% for GPT-5)

Evaluation Highlights

  • Reasoning-optimized models consistently outperform general-purpose models; GPT-5 achieves 81.3% success rate with the variable-restatement mitigation vs. 62.5% baseline, a 18.8 percentage point improvement
  • Performance degrades sharply with increasing dependency depth, and 'connected distractors' (type-compatible irrelevant functions) are disproportionately difficult, revealing that models conflate structurally similar but semantically irrelevant functions

Breakthrough Assessment

6/10 The paper makes a solid, practically valuable contribution by providing a rigorous, contamination-free evaluation methodology and diagnosing a specific, actionable failure mode (state tracking) with a surprisingly effective fix. However, the core ideas—synthetic benchmark generation and DAG-based task formulation—are incremental extensions of existing evaluation paradigms rather than a paradigm shift.

Methodology

  1. Step 1: Construct a function-dependency DAG where each node is a callable function and edges encode data dependencies; the model's goal is to infer and traverse the correct call sequence to compute a target output value
  2. Step 2: Parameterize task difficulty along multiple axes—DAG depth, breadth, presence of distractor functions (including type-compatible 'connected distractors')—and generate synthetic tasks programmatically to avoid any overlap with pretraining data
  3. Step 3: Evaluate TaLMs on generated tasks, analyze failure modes (syntactically valid but semantically incorrect argument propagation), and apply the mitigation strategy of explicitly restating previously computed variable values to the model at each reasoning step

System Components

Function-Dependency DAG

A directed acyclic graph where nodes represent callable functions and edges encode input-output dependencies; models must infer the correct traversal order to compute a target value

Difficulty Controller

Parameterizes task generation along axes including graph size, dependency depth, and distractor function count/type, enabling systematic difficulty scaling

Connected Distractors

Irrelevant functions that share type-compatible variables with task-relevant functions, designed to probe models' ability to avoid plausible-but-wrong tool calls

Contamination-Free Generator

Synthetic task generation pipeline that produces novel function-call chains at evaluation time, preventing pretraining/test-time data leakage

Variable Restatement Mitigation

A prompting strategy that explicitly re-presents previously computed variable values to the model at each step, addressing brittle state tracking in multi-turn tool use

Results

Metric/Condition Baseline (no mitigation) With Variable Restatement Delta
GPT-5 Success Rate 62.5% 81.3% +18.8pp
Performance vs. Dependency Depth High at shallow depth Sharp decline at greater depth Depth is a key failure axis
Reasoning-optimized vs. General-purpose General-purpose lower Reasoning-optimized consistently higher Architecture matters for multi-step calling
Connected Distractor Handling Strong degradation Partially improved with restatement Remains a challenging failure mode

Key Takeaways

  • Explicitly restating prior computed variable values in the prompt at each step is a cheap, high-impact mitigation for multi-step function calling failures—practitioners should adopt this pattern when building agentic pipelines
  • Benchmark contamination is a serious concern for TaLM evaluation; synthetic, programmatically generated benchmarks like FuncBenchGen should be preferred over static datasets when assessing tool-use capabilities
  • Connected distractors (type-compatible irrelevant functions) are a particularly difficult challenge and should be included in any rigorous evaluation of LLM tool selection, as they expose subtle reasoning failures that simpler benchmarks miss

Abstract

Existing benchmarks for tool-augmented language models (TaLMs) lack fine-grained control over task difficulty and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks to stress-test TaLMs. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where models must infer the correct sequence of calls to compute a target value. FuncBenchGen allows precise control over task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding pretraining/test-time leakage. Our evaluation demonstrates reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other available models. Performance declines sharply as dependency depth increases. Furthermore, connected distractors -- irrelevant functions sharing type-compatible variables with relevant functions -- prove especially difficult to handle. Also, strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding an improvement in success rate from 62.5% to 81.3% for GPT-5.

Generated on 2026-02-21 using Claude