ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
Problem Statement
LLM agents operating over large, real-world toolsets suffer from ambiguity caused by redundant tools with overlapping names and descriptions, degrading selection accuracy. Additionally, LLMs face strict context window limits that prevent them from considering large toolsets in a single pass, forcing trade-offs between coverage and efficiency. Existing approaches lack automated mechanisms to both consolidate redundant tools and dynamically compress toolsets to fit within context constraints.
Key Novelty
- ToolScopeMerger with Auto-Correction: an automated pipeline that audits tool definitions, identifies redundant/overlapping tools, merges them, and self-corrects merge errors to reduce ambiguity
- ToolScopeRetriever: a context-aware ranking and filtering module that selects only the most query-relevant tools, compressing toolsets to fit within LLM context limits without sacrificing selection accuracy
- Combined end-to-end system evaluated across three LLMs and three open-source tool-use benchmarks showing substantial accuracy gains (8.38%–38.6%)
Evaluation Highlights
- Tool selection accuracy improved by 8.38% to 38.6% over baselines across three state-of-the-art LLMs and three open-source tool-use benchmarks
- Consistent gains across diverse LLM backbones and benchmark datasets, demonstrating robustness of the approach beyond a single model or task setting
Breakthrough Assessment
Methodology
- Step 1 – Tool Auditing and Merging: ToolScopeMerger analyzes the full toolset to detect redundant or semantically overlapping tools, merges them into unified tool definitions, and applies an Auto-Correction mechanism to validate and fix incorrect merges
- Step 2 – Context-Aware Retrieval and Filtering: ToolScopeRetriever encodes incoming queries and tool descriptions, ranks tools by relevance, and selects a compressed subset that fits within the LLM's context window for a given query
- Step 3 – Agent Tool Use: The pruned, deduplicated toolset is passed to the LLM agent for tool selection and task execution, evaluated against ground-truth tool choices on benchmark datasets
System Components
Automatically audits a toolset for redundant tools with overlapping names and descriptions, merges them into consolidated definitions, and applies a self-correction loop to detect and fix erroneous merges, reducing ambiguity in the tool selection space
A context-aware ranking and filtering module that scores tools by relevance to each incoming query and selects only the top-k most relevant tools, compressing the effective toolset to fit within LLM context limits while preserving the tools needed for accurate task completion
Results
| Benchmark/Setting | Baseline Accuracy | ToolScope Accuracy | Delta |
|---|---|---|---|
| Best-case benchmark/LLM pair | Not specified | Not specified | +38.6% |
| Worst-case benchmark/LLM pair | Not specified | Not specified | +8.38% |
| Average across 3 LLMs × 3 benchmarks | Not specified | Not specified | +8.38% to +38.6% |
Key Takeaways
- Deduplicating and merging overlapping tools before LLM inference is a high-impact, low-cost preprocessing step that meaningfully reduces ambiguity and improves tool selection accuracy in real-world toolsets
- Context-aware retrieval filtering is essential for scaling LLM agents to large toolsets — selecting only query-relevant tools rather than passing the full toolset can yield large accuracy gains while respecting context window limits
- The Auto-Correction mechanism in ToolScopeMerger highlights the importance of validation loops when using LLMs or automated systems to restructure tool definitions, as naive merging can introduce new errors
Abstract
Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.