HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems
Problem Statement
Existing multi-agent LLM systems rely on predefined agent roles and static communication topologies, which severely limits adaptability in complex or specialized task environments. This rigidity causes subpar performance on expert-level tasks such as advanced mathematics or nuanced moral reasoning. Additionally, most users lack prompt engineering expertise, creating a further gap between raw user intent and effective task execution.
Key Novelty
- Three-tier hierarchical agent architecture separating task decomposition (high-level), role instantiation (mid-level), and subtask execution (low-level) with dynamic agent creation per subtask
- Reformulation of subtask execution as a structured workflow search problem using Monte Carlo Tree Search (MCTS) to systematically explore agentic action spaces and identify optimal reasoning trajectories
- Adaptive Prompt Refinement module that automatically transforms raw, unstructured user queries into task-specific, high-quality prompts without requiring prompt engineering expertise
Evaluation Highlights
- 14.4% average performance improvement over state-of-the-art multi-agent baselines across HumanEval (Code Generation), MMLU (General Reasoning), and MATH (Arithmetic Reasoning) benchmarks
- Up to 13.3% gain on MMLU Moral Scenarios and up to 19.6% gain on MATH Algebra subarea, demonstrating strong performance on specialized and expert-level tasks
Breakthrough Assessment
Methodology
- A high-level planning agent receives the (optionally refined) user query and decomposes it into structured subtasks, establishing the overall execution plan and dependencies
- Mid-level role-design agents dynamically instantiate task-specific inference agents for each subtask, selecting appropriate personas, tools, and configurations rather than relying on predefined agent roles
- Low-level inference agents execute each subtask using MCTS to search over the agentic action space, constructing and evaluating reasoning trajectories to find optimal solutions before aggregating results
System Components
Decomposes the overall user task into structured, manageable subtasks and defines the execution flow and inter-subtask dependencies
Dynamically instantiate specialized inference agents tailored to each subtask's requirements, replacing static pre-defined agent roles with on-demand agent creation
Execute individual subtasks and leverage MCTS to systematically explore the reasoning/action space, selecting optimal trajectories for complex problem solving
Treats subtask execution as a search problem, exploring branching reasoning paths, evaluating intermediate states, and backpropagating quality signals to find the best solution trajectory
Automatically transforms raw user queries into well-structured, task-specific prompts, lowering the barrier for non-expert users and improving downstream agent performance
Results
| Benchmark / Subtask | Best Baseline (SotA) | HALO | Delta |
|---|---|---|---|
| HumanEval (Code Generation) | SotA MAS baseline | Improved | Part of +14.4% avg |
| MMLU (General Reasoning) | SotA MAS baseline | Improved | Part of +14.4% avg |
| MMLU - Moral Scenarios | SotA MAS baseline | Best score | Up to +13.3% |
| MATH (Arithmetic Reasoning) | SotA MAS baseline | Improved | Part of +14.4% avg |
| MATH - Algebra | SotA MAS baseline | Best score | Up to +19.6% |
| Overall Average | SotA MAS baseline | HALO | +14.4% |
Key Takeaways
- Dynamic agent role instantiation at runtime—rather than fixing agent personas upfront—significantly improves adaptability; practitioners building MAS pipelines should consider separating role-design from task execution as distinct architectural layers
- Framing complex subtask execution as a search problem (via MCTS) is a powerful paradigm for improving LLM reasoning quality, especially on expert-level or multi-step tasks where greedy single-pass generation is insufficient
- An automatic prompt refinement stage is a practical, high-impact addition to any LLM-based system deployed to non-expert users, as query quality directly bottlenecks downstream agent performance
Abstract
Recent advancements in Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) have demonstrated tremendous potential in diverse task scenarios. Nonetheless, existing agentic systems typically rely on predefined agent-role design spaces and static communication structures, limiting their adaptability as well as flexibility in complex interaction environments and leading to subpar performance on highly specialized and expert-level tasks. To address these issues, we introduce HALO, a multi-agent collaboration framework based on a hierarchical reasoning architecture. Specifically, we incorporate a high-level planning agent for task decomposition, mid-level role-design agents for subtask-specific agent instantiation, and low-level inference agents for subtask execution. Particularly, subtask execution is reformulated as a structured workflow search problem, where Monte Carlo Tree Search (MCTS) systematically explores the agentic action space to construct optimal reasoning trajectories. Additionally, as the majority of users lack expertise in prompt engineering, we leverage an Adaptive Prompt Refinement module to transform raw queries into task-specific prompts. Empirical evaluations on Code Generation (HumanEval), General Reasoning (MMLU), and Arithmetic Reasoning (MATH) benchmark datasets highlight the effectiveness of HALO, yielding a 14.4% average improvement over state-of-the-art baselines. Notably, HALO achieves up to 13.3% performance gain on the Moral Scenarios subject in the MMLU benchmark and up to 19.6% performance gain on the Algebra subarea in the MATH benchmark, indicating its advanced proficiency in tackling highly specialized and expert-level tasks. The code repository is available at https://github.com/23japhone/HALO.