ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models
Problem Statement
Retrieval-based pipelines fail to capture complex tool semantics and LLMs lack intrinsic tool knowledge from pretraining. Generative methods that assign each tool a unique new token suffer from vocabulary explosion (scaling linearly with tool count) and semantic isolation, making it impossible to learn collaborative tool relationships at scale. With libraries approaching tens of thousands of tools, these limitations create a fundamental scalability and generalization crisis.
Key Novelty
- Hierarchical code sequences for tool representation that make vocabulary expansion logarithmic rather than linear in the number of tools
- Novel tokenization process that weaves together intrinsic tool semantics with extrinsic co-usage patterns to generate structured, shared codes
- Generative alignment stage that fine-tunes the LLM to produce hierarchical code sequences, enabling dense co-occurrence learning of collaborative tool relationships instead of sparse monolithic ID co-occurrences
Evaluation Highlights
- Evaluated on a benchmark with nearly 47,000 tools, significantly outperforming state-of-the-art retrieval-based and generative tool-selection methods
- Demonstrates superior scalability and generalization compared to prior methods, particularly as tool library size grows large
Breakthrough Assessment
Methodology
- Step 1 - Structured Code Generation: Apply a novel tokenization process to each tool that encodes both its intrinsic semantic properties (description, function) and extrinsic co-usage patterns (which tools are used together), producing a hierarchical code sequence per tool
- Step 2 - Vocabulary Construction: Build a compact shared codebook where tool codes are composed of reusable sub-tokens, ensuring vocabulary size grows logarithmically with tool count and enabling dense co-occurrence signals across tools sharing code components
- Step 3 - Generative Alignment Fine-tuning: Fine-tune the LLM on (task, hierarchical tool code sequence) pairs so the model learns to directly generate the structured code identifiers for the appropriate tools, unifying tool selection and execution within the LLM
System Components
Converts each tool into a structured sequence of shared sub-codes by jointly encoding the tool's semantic description and its co-usage relationships with other tools in the library
A shared vocabulary of code tokens derived from clustering tool semantics and usage patterns, enabling logarithmic vocabulary growth and dense co-occurrence learning
Fine-tunes the LLM to autoregressively generate hierarchical code sequences for the correct tools given a user query, replacing retrieval with direct generative selection
Captures extrinsic relationships between tools by modeling which tools frequently appear together in task solutions, informing the code assignment process
Results
| Metric/Benchmark | Best Baseline | ToolWeaver | Delta |
|---|---|---|---|
| Tool selection accuracy (~47K tools) | Lower (prior SOTA) | Significantly higher | Substantial improvement |
| Vocabulary size scaling | Linear in # tools | Logarithmic in # tools | Orders of magnitude reduction |
| Generalization to unseen tools | Poor (semantically isolated tokens) | Improved (shared codes) | Qualitatively better |
| Collaborative tool relationship learning | Sparse (monolithic ID co-occurrence) | Dense (shared code co-occurrence) | Qualitatively richer |
Key Takeaways
- For practitioners building tool-augmented agents at scale, replacing per-tool unique tokens with hierarchical shared codes is a critical design choice that prevents vocabulary explosion and enables semantic generalization to new tools
- Co-usage patterns are as important as intrinsic tool semantics for effective tool selection; encoding both into the tool representation significantly improves an LLM's ability to orchestrate multi-tool workflows
- Generative tool selection (LLM directly generates tool identifiers) is a more promising architecture than retrieval-based pipelines for large tool libraries, provided the representation scheme is scalable — ToolWeaver's logarithmic codebook makes this practical at ~47K tool scale
Abstract
Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool's intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.