HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

HM-RAG introduces a three-tiered hierarchical multi-agent framework that coordinates specialized agents for query decomposition, modality-specific retrieval, and answer synthesis to handle complex multimodal queries across structured, unstructured, and graph-based data. This collaborative architecture overcomes the fundamental limitations of single-agent RAG systems in resolving queries that require coordinated reasoning across heterogeneous data ecosystems.

Problem Statement

Conventional single-agent RAG systems are inadequate for complex queries that require coordinated reasoning across diverse data types (structured databases, unstructured text, knowledge graphs, web sources). Existing systems lack mechanisms to decompose multi-part queries, retrieve from heterogeneous modalities in parallel, and reconcile potentially conflicting evidence from different sources. This creates critical gaps in multimodal reasoning tasks like scientific QA and crisis situation analysis where integrated knowledge synthesis is essential.

Key Novelty

Hierarchical three-tiered multi-agent architecture with role specialization (Decomposition, Multi-source Retrieval, and Decision agents) enabling coordinated intelligence across heterogeneous data ecosystems
Semantic-aware query rewriting combined with schema-guided context augmentation in the Decomposition Agent, enabling contextually coherent sub-task generation from complex queries
Consistency voting mechanism in the Decision Agent coupled with Expert Model Refinement to resolve discrepancies across multi-source retrieval results, enabling robust answer integration

Evaluation Highlights

12.95% improvement in answer accuracy over baseline RAG systems on ScienceQA and CrisisMMD benchmarks
3.56% boost in question classification accuracy over baseline, with state-of-the-art zero-shot results established on both ScienceQA and CrisisMMD datasets

Breakthrough Assessment

6/10 HM-RAG presents a solid and practically motivated contribution by combining hierarchical multi-agent coordination with multimodal retrieval, achieving meaningful gains on established benchmarks. However, the core ideas (query decomposition, parallel retrieval, ensemble voting) are evolutionary extensions of existing RAG paradigms rather than a fundamental paradigm shift, placing it firmly in the solid contribution category.

Methodology

Step 1 - Query Decomposition: The Decomposition Agent receives a complex multimodal query, applies semantic-aware query rewriting to clarify intent, and uses schema-guided context augmentation to break the query into contextually coherent sub-tasks tailored to different data modalities
Step 2 - Parallel Multi-source Retrieval: Specialized Multi-source Retrieval Agents execute simultaneous, modality-specific retrieval using plug-and-play modules targeting vector databases (unstructured text), graph databases (relational/knowledge graph data), and web-based sources, returning modality-aligned evidence sets
Step 3 - Decision and Answer Synthesis: The Decision Agent aggregates multi-source answers using consistency voting to identify high-confidence responses, applies Expert Model Refinement to resolve discrepancies between conflicting retrieval results, and produces a final integrated answer

System Components

Decomposition Agent

Top-tier agent that dissects complex queries into sub-tasks using semantic-aware query rewriting to capture intent and schema-guided context augmentation to align sub-tasks with the structure of available data sources

Multi-source Retrieval Agents

Middle-tier parallel agents using plug-and-play modality-specific modules to retrieve evidence from vector databases (dense text retrieval), graph databases (relational and knowledge graph traversal), and web-based sources simultaneously

Decision Agent

Bottom-tier synthesis agent that applies consistency voting across multi-source answers to select high-agreement responses and uses Expert Model Refinement to reconcile conflicting evidence and produce a final, coherent answer

Semantic-aware Query Rewriting

Sub-component of the Decomposition Agent that reformulates ambiguous or complex queries into clearer, intent-preserving forms suitable for downstream specialized retrieval

Schema-guided Context Augmentation

Sub-component that enriches sub-tasks with structural metadata from available data schemas, improving retrieval precision across heterogeneous data sources

Expert Model Refinement

Post-voting refinement step within the Decision Agent that leverages specialized models to resolve cases where consistency voting yields low confidence or contradictory results

Results

Metric/Benchmark	Baseline RAG	HM-RAG	Delta
Answer Accuracy (ScienceQA + CrisisMMD)	Baseline RAG	Baseline + 12.95%	+12.95%
Question Classification Accuracy	Baseline RAG	Baseline + 3.56%	+3.56%
Zero-shot Performance (ScienceQA)	Non-SOTA	State-of-the-Art	SOTA
Zero-shot Performance (CrisisMMD)	Non-SOTA	State-of-the-Art	SOTA

Key Takeaways

Hierarchical agent specialization is a practical design pattern for RAG systems: separating query decomposition, retrieval, and synthesis into dedicated agents reduces task complexity per agent and improves overall system accuracy on multi-hop and multimodal queries
Plug-and-play modular retrieval design allows practitioners to incrementally extend HM-RAG with new data modalities (e.g., adding a new graph or vector source) without retraining or restructuring the entire pipeline, making it suitable for enterprise environments with evolving data ecosystems
Consistency voting with expert refinement is a robust strategy for handling retrieval conflicts in multimodal systems—practitioners building RAG pipelines over heterogeneous sources should consider ensemble-style answer reconciliation rather than naive concatenation or single-source prioritization

Abstract

While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of a three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems.