ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models

ToolWeaver proposes a generative tool learning framework that encodes tools into hierarchical code sequences, replacing per-tool unique tokens with structured identifiers that capture both intrinsic semantics and collaborative co-usage patterns for scalable LLM tool use.

Problem Statement

Retrieval-based pipelines fail to capture complex tool semantics and LLMs lack intrinsic tool knowledge from pretraining. Generative methods that assign each tool a unique new token suffer from vocabulary explosion (scaling linearly with tool count) and semantic isolation, making it impossible to learn collaborative tool relationships at scale. With libraries approaching tens of thousands of tools, these limitations create a fundamental scalability and generalization crisis.

Key Novelty

Hierarchical code sequences for tool representation that make vocabulary expansion logarithmic rather than linear in the number of tools
Novel tokenization process that weaves together intrinsic tool semantics with extrinsic co-usage patterns to generate structured, shared codes
Generative alignment stage that fine-tunes the LLM to produce hierarchical code sequences, enabling dense co-occurrence learning of collaborative tool relationships instead of sparse monolithic ID co-occurrences

Evaluation Highlights

Evaluated on a benchmark with nearly 47,000 tools, significantly outperforming state-of-the-art retrieval-based and generative tool-selection methods
Demonstrates superior scalability and generalization compared to prior methods, particularly as tool library size grows large

Signal Assessment

7/10 ToolWeaver presents a principled and practically impactful solution to the scalability bottleneck in generative tool use, a problem that becomes increasingly critical as agentic AI systems grow; the logarithmic vocabulary scaling and semantic co-usage encoding are genuinely novel contributions, though the approach builds on established ideas from product quantization and code-based generative retrieval.

Methodology

Step 1 - Structured Code Generation: Apply a novel tokenization process to each tool that encodes both its intrinsic semantic properties (description, function) and extrinsic co-usage patterns (which tools are used together), producing a hierarchical code sequence per tool
Step 2 - Vocabulary Construction: Build a compact shared codebook where tool codes are composed of reusable sub-tokens, ensuring vocabulary size grows logarithmically with tool count and enabling dense co-occurrence signals across tools sharing code components
Step 3 - Generative Alignment Fine-tuning: Fine-tune the LLM on (task, hierarchical tool code sequence) pairs so the model learns to directly generate the structured code identifiers for the appropriate tools, unifying tool selection and execution within the LLM

System Components

Hierarchical Code Tokenizer

Converts each tool into a structured sequence of shared sub-codes by jointly encoding the tool's semantic description and its co-usage relationships with other tools in the library

Collaborative Codebook

A shared vocabulary of code tokens derived from clustering tool semantics and usage patterns, enabling logarithmic vocabulary growth and dense co-occurrence learning

Generative Alignment Module

Fine-tunes the LLM to autoregressively generate hierarchical code sequences for the correct tools given a user query, replacing retrieval with direct generative selection

Co-usage Pattern Encoder

Captures extrinsic relationships between tools by modeling which tools frequently appear together in task solutions, informing the code assignment process

Results

Metric/Benchmark	Best Baseline	ToolWeaver	Delta
Tool selection accuracy (~47K tools)	Lower (prior SOTA)	Significantly higher	Substantial improvement
Vocabulary size scaling	Linear in # tools	Logarithmic in # tools	Orders of magnitude reduction
Generalization to unseen tools	Poor (semantically isolated tokens)	Improved (shared codes)	Qualitatively better
Collaborative tool relationship learning	Sparse (monolithic ID co-occurrence)	Dense (shared code co-occurrence)	Qualitatively richer

Key Takeaways

For practitioners building tool-augmented agents at scale, replacing per-tool unique tokens with hierarchical shared codes is a critical design choice that prevents vocabulary explosion and enables semantic generalization to new tools
Co-usage patterns are as important as intrinsic tool semantics for effective tool selection; encoding both into the tool representation significantly improves an LLM's ability to orchestrate multi-tool workflows
Generative tool selection (LLM directly generates tool identifiers) is a more promising architecture than retrieval-based pipelines for large tool libraries, provided the representation scheme is scalable — ToolWeaver's logarithmic codebook makes this practical at ~47K tool scale

Abstract

Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool's intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.