HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems

HALO is a hierarchical multi-agent LLM framework that combines three-tier agent orchestration (planning, role-design, inference) with Monte Carlo Tree Search to dynamically construct optimal reasoning trajectories for complex, expert-level tasks.

Problem Statement

Existing multi-agent LLM systems rely on predefined agent roles and static communication topologies, which severely limits adaptability in complex or specialized task environments. This rigidity causes subpar performance on expert-level tasks such as advanced mathematics or nuanced moral reasoning. Additionally, most users lack prompt engineering expertise, creating a further gap between raw user intent and effective task execution.

Key Novelty

Three-tier hierarchical agent architecture separating task decomposition (high-level), role instantiation (mid-level), and subtask execution (low-level) with dynamic agent creation per subtask
Reformulation of subtask execution as a structured workflow search problem using Monte Carlo Tree Search (MCTS) to systematically explore agentic action spaces and identify optimal reasoning trajectories
Adaptive Prompt Refinement module that automatically transforms raw, unstructured user queries into task-specific, high-quality prompts without requiring prompt engineering expertise

Evaluation Highlights

14.4% average performance improvement over state-of-the-art multi-agent baselines across HumanEval (Code Generation), MMLU (General Reasoning), and MATH (Arithmetic Reasoning) benchmarks
Up to 13.3% gain on MMLU Moral Scenarios and up to 19.6% gain on MATH Algebra subarea, demonstrating strong performance on specialized and expert-level tasks

Breakthrough Assessment

6/10 HALO presents a well-motivated and technically sound combination of hierarchical agent orchestration and MCTS-based reasoning search, yielding meaningful empirical gains. However, the core components (MCTS, hierarchical planning, prompt refinement) are individually established ideas, making this a solid engineering contribution and architectural synthesis rather than a fundamental methodological breakthrough.

Methodology

A high-level planning agent receives the (optionally refined) user query and decomposes it into structured subtasks, establishing the overall execution plan and dependencies
Mid-level role-design agents dynamically instantiate task-specific inference agents for each subtask, selecting appropriate personas, tools, and configurations rather than relying on predefined agent roles
Low-level inference agents execute each subtask using MCTS to search over the agentic action space, constructing and evaluating reasoning trajectories to find optimal solutions before aggregating results

System Components

High-Level Planning Agent

Decomposes the overall user task into structured, manageable subtasks and defines the execution flow and inter-subtask dependencies

Mid-Level Role-Design Agents

Dynamically instantiate specialized inference agents tailored to each subtask's requirements, replacing static pre-defined agent roles with on-demand agent creation

Low-Level Inference Agents

Execute individual subtasks and leverage MCTS to systematically explore the reasoning/action space, selecting optimal trajectories for complex problem solving

Monte Carlo Tree Search (MCTS) Module

Treats subtask execution as a search problem, exploring branching reasoning paths, evaluating intermediate states, and backpropagating quality signals to find the best solution trajectory

Adaptive Prompt Refinement (APR) Module

Automatically transforms raw user queries into well-structured, task-specific prompts, lowering the barrier for non-expert users and improving downstream agent performance

Results

Benchmark / Subtask	Best Baseline (SotA)	HALO	Delta
HumanEval (Code Generation)	SotA MAS baseline	Improved	Part of +14.4% avg
MMLU (General Reasoning)	SotA MAS baseline	Improved	Part of +14.4% avg
MMLU - Moral Scenarios	SotA MAS baseline	Best score	Up to +13.3%
MATH (Arithmetic Reasoning)	SotA MAS baseline	Improved	Part of +14.4% avg
MATH - Algebra	SotA MAS baseline	Best score	Up to +19.6%
Overall Average	SotA MAS baseline	HALO	+14.4%

Key Takeaways

Dynamic agent role instantiation at runtime—rather than fixing agent personas upfront—significantly improves adaptability; practitioners building MAS pipelines should consider separating role-design from task execution as distinct architectural layers
Framing complex subtask execution as a search problem (via MCTS) is a powerful paradigm for improving LLM reasoning quality, especially on expert-level or multi-step tasks where greedy single-pass generation is insufficient
An automatic prompt refinement stage is a practical, high-impact addition to any LLM-based system deployed to non-expert users, as query quality directly bottlenecks downstream agent performance

Abstract

Recent advancements in Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) have demonstrated tremendous potential in diverse task scenarios. Nonetheless, existing agentic systems typically rely on predefined agent-role design spaces and static communication structures, limiting their adaptability as well as flexibility in complex interaction environments and leading to subpar performance on highly specialized and expert-level tasks. To address these issues, we introduce HALO, a multi-agent collaboration framework based on a hierarchical reasoning architecture. Specifically, we incorporate a high-level planning agent for task decomposition, mid-level role-design agents for subtask-specific agent instantiation, and low-level inference agents for subtask execution. Particularly, subtask execution is reformulated as a structured workflow search problem, where Monte Carlo Tree Search (MCTS) systematically explores the agentic action space to construct optimal reasoning trajectories. Additionally, as the majority of users lack expertise in prompt engineering, we leverage an Adaptive Prompt Refinement module to transform raw queries into task-specific prompts. Empirical evaluations on Code Generation (HumanEval), General Reasoning (MMLU), and Arithmetic Reasoning (MATH) benchmark datasets highlight the effectiveness of HALO, yielding a 14.4% average improvement over state-of-the-art baselines. Notably, HALO achieves up to 13.3% performance gain on the Moral Scenarios subject in the MMLU benchmark and up to 19.6% performance gain on the Algebra subarea in the MATH benchmark, indicating its advanced proficiency in tackling highly specialized and expert-level tasks. The code repository is available at https://github.com/23japhone/HALO.