TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

TUMIX is an ensemble framework that runs multiple LLM agents in parallel, each using distinct tool-use strategies (text reasoning, code execution, search), and iteratively shares and refines answers to improve reasoning accuracy at test time. It achieves significant performance gains over state-of-the-art tool-augmented LLMs with near-equal or reduced inference costs.

Problem Statement

Despite powerful tools like Code Interpreter and Search being available to LLMs, there is no principled guidance on how to optimally combine textual reasoning, coding, and search for diverse question types. Single-agent systems are limited by committing to one tool-use strategy, missing complementary strengths of different approaches. Existing test-time scaling methods either ignore tool diversity or lack efficient confidence-based stopping, leading to unnecessary compute overhead.

Key Novelty

Tool-Use Mixture (TUMIX) ensemble framework that runs heterogeneous agents with distinct tool-use strategies in parallel and enables iterative cross-agent response sharing and refinement
LLM-driven auto-optimization of agent designs to automatically maximize agent diversity and quality without manual engineering
Confidence-based early stopping mechanism that halts iterative refinement when sufficient consensus/confidence is reached, reducing inference cost to ~49% while preserving performance

Evaluation Highlights

Average accuracy improvement of up to 3.55% over the best baseline (tool-augmented and test-time scaling methods) on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks
Confidence-based halting preserves full performance at only 49% of the standard inference cost, while further scaling beyond default iterations yields additional performance gains at higher cost

Breakthrough Assessment

6/10 TUMIX is a solid and practically relevant contribution that advances tool-augmented multi-agent reasoning with a well-motivated ensemble approach and efficient cost management, but it is primarily an engineering and orchestration advance rather than a fundamental algorithmic or architectural breakthrough.

Methodology

Step 1 – Agent Pool Construction: Instantiate multiple LLM agents, each configured with a distinct tool-use strategy (e.g., pure text reasoning, code interpreter only, search only, hybrid combinations); optionally use LLM-based auto-optimization to design diverse, high-quality agent configurations
Step 2 – Parallel Inference & Iterative Refinement: All agents process the same question in parallel, then iteratively share their current answers with each other; each agent refines its response conditioned on the question and the pool of previous answers from other agents
Step 3 – Confidence-Based Aggregation & Early Stopping: Monitor inter-agent agreement or a confidence signal after each refinement round; halt early if sufficient confidence is reached to save compute, otherwise continue scaling iterations; aggregate final answers (e.g., majority vote or weighted ensemble) to produce the output

System Components

Heterogeneous Agent Pool

A set of LLM agents each assigned a distinct tool-use strategy (text-only reasoning, code execution via Code Interpreter, web search, or combinations), ensuring diverse solution paths for the same question

Cross-Agent Iterative Refinement

A communication protocol where agents share their current answers after each round and condition subsequent responses on the question plus all prior agent outputs, enabling collaborative error correction

LLM-Based Agent Auto-Optimizer

A meta-level LLM module that automatically designs and selects agent configurations to maximize diversity and individual quality, reducing the need for manual prompt/strategy engineering

Confidence-Based Early Stopping

A mechanism that monitors inter-agent agreement or an explicit confidence score and terminates the refinement loop early when consensus is sufficiently high, cutting inference cost to ~49% with minimal performance loss

Results

Benchmark/Setting	Best Baseline	TUMIX	Delta
Avg. accuracy (Gemini-2.5-Pro, reasoning benchmarks)	State-of-the-art tool-augmented/TTS baseline	Up to +3.55% absolute	+3.55%
Avg. accuracy (Gemini-2.5-Flash, reasoning benchmarks)	State-of-the-art tool-augmented/TTS baseline	Up to +3.55% absolute	+3.55%
Inference cost (confidence-based halting)	100% (standard TUMIX)	~49% of standard cost	-51% cost, same performance
Inference cost vs. performance (further scaling)	Default iteration budget	Higher accuracy	Tradeoff: more compute for more gains

Key Takeaways

Mixing tool-use strategies across parallel agents (text, code, search) is more effective than committing to a single strategy, making heterogeneous ensembles a practical default for tool-augmented LLM deployments
Agent diversity and quality are the primary drivers of ensemble performance; using LLMs to auto-optimize agent designs is a scalable alternative to manual prompt engineering when building multi-agent systems
Confidence-based early stopping is a practical mechanism to cut inference costs by ~50% without sacrificing accuracy, making multi-agent test-time scaling economically viable for production use cases

Abstract

While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.