Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
Problem Statement
Existing TaLM benchmarks are susceptible to data contamination (training data leakage) and lack fine-grained control over task difficulty dimensions such as dependency depth, graph size, and distractor functions. This makes it impossible to isolate failure modes or reliably compare models, and static benchmarks become obsolete as models are trained on their contents. There is also no principled way to diagnose why models fail at multi-step tool use.
Key Novelty
- FuncBenchGen: a unified synthetic benchmark generation framework that models tool use as DAG traversal, eliminating contamination by generating novel tasks at evaluation time
- Controllable difficulty axes (graph size, dependency depth, distractor function types including type-compatible 'connected distractors') that allow precise ablation of model capabilities
- Identification of brittle state tracking as a key failure mode and a lightweight mitigation—explicitly restating prior variable values at each step—that yields large performance gains (e.g., 62.5% → 81.3% for GPT-5)
Evaluation Highlights
- Reasoning-optimized models consistently outperform general-purpose models; GPT-5 achieves 81.3% success rate with the variable-restatement mitigation vs. 62.5% baseline, a 18.8 percentage point improvement
- Performance degrades sharply with increasing dependency depth, and 'connected distractors' (type-compatible irrelevant functions) are disproportionately difficult, revealing that models conflate structurally similar but semantically irrelevant functions
Breakthrough Assessment
Methodology
- Step 1: Construct a function-dependency DAG where each node is a callable function and edges encode data dependencies; the model's goal is to infer and traverse the correct call sequence to compute a target output value
- Step 2: Parameterize task difficulty along multiple axes—DAG depth, breadth, presence of distractor functions (including type-compatible 'connected distractors')—and generate synthetic tasks programmatically to avoid any overlap with pretraining data
- Step 3: Evaluate TaLMs on generated tasks, analyze failure modes (syntactically valid but semantically incorrect argument propagation), and apply the mitigation strategy of explicitly restating previously computed variable values to the model at each reasoning step
System Components
A directed acyclic graph where nodes represent callable functions and edges encode input-output dependencies; models must infer the correct traversal order to compute a target value
Parameterizes task generation along axes including graph size, dependency depth, and distractor function count/type, enabling systematic difficulty scaling
Irrelevant functions that share type-compatible variables with task-relevant functions, designed to probe models' ability to avoid plausible-but-wrong tool calls
Synthetic task generation pipeline that produces novel function-call chains at evaluation time, preventing pretraining/test-time data leakage
A prompting strategy that explicitly re-presents previously computed variable values to the model at each step, addressing brittle state tracking in multi-turn tool use
Results
| Metric/Condition | Baseline (no mitigation) | With Variable Restatement | Delta |
|---|---|---|---|
| GPT-5 Success Rate | 62.5% | 81.3% | +18.8pp |
| Performance vs. Dependency Depth | High at shallow depth | Sharp decline at greater depth | Depth is a key failure axis |
| Reasoning-optimized vs. General-purpose | General-purpose lower | Reasoning-optimized consistently higher | Architecture matters for multi-step calling |
| Connected Distractor Handling | Strong degradation | Partially improved with restatement | Remains a challenging failure mode |
Key Takeaways
- Explicitly restating prior computed variable values in the prompt at each step is a cheap, high-impact mitigation for multi-step function calling failures—practitioners should adopt this pattern when building agentic pipelines
- Benchmark contamination is a serious concern for TaLM evaluation; synthetic, programmatically generated benchmarks like FuncBenchGen should be preferred over static datasets when assessing tool-use capabilities
- Connected distractors (type-compatible irrelevant functions) are a particularly difficult challenge and should be included in any rigorous evaluation of LLM tool selection, as they expose subtle reasoning failures that simpler benchmarks miss
Abstract
Existing benchmarks for tool-augmented language models (TaLMs) lack fine-grained control over task difficulty and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks to stress-test TaLMs. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where models must infer the correct sequence of calls to compute a target value. FuncBenchGen allows precise control over task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding pretraining/test-time leakage. Our evaluation demonstrates reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other available models. Performance declines sharply as dependency depth increases. Furthermore, connected distractors -- irrelevant functions sharing type-compatible variables with relevant functions -- prove especially difficult to handle. Also, strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding an improvement in success rate from 62.5% to 81.3% for GPT-5.