HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation
Problem Statement
Conventional single-agent RAG systems are inadequate for complex queries that require coordinated reasoning across diverse data types (structured databases, unstructured text, knowledge graphs, web sources). Existing systems lack mechanisms to decompose multi-part queries, retrieve from heterogeneous modalities in parallel, and reconcile potentially conflicting evidence from different sources. This creates critical gaps in multimodal reasoning tasks like scientific QA and crisis situation analysis where integrated knowledge synthesis is essential.
Key Novelty
- Hierarchical three-tiered multi-agent architecture with role specialization (Decomposition, Multi-source Retrieval, and Decision agents) enabling coordinated intelligence across heterogeneous data ecosystems
- Semantic-aware query rewriting combined with schema-guided context augmentation in the Decomposition Agent, enabling contextually coherent sub-task generation from complex queries
- Consistency voting mechanism in the Decision Agent coupled with Expert Model Refinement to resolve discrepancies across multi-source retrieval results, enabling robust answer integration
Evaluation Highlights
- 12.95% improvement in answer accuracy over baseline RAG systems on ScienceQA and CrisisMMD benchmarks
- 3.56% boost in question classification accuracy over baseline, with state-of-the-art zero-shot results established on both ScienceQA and CrisisMMD datasets
Breakthrough Assessment
Methodology
- Step 1 - Query Decomposition: The Decomposition Agent receives a complex multimodal query, applies semantic-aware query rewriting to clarify intent, and uses schema-guided context augmentation to break the query into contextually coherent sub-tasks tailored to different data modalities
- Step 2 - Parallel Multi-source Retrieval: Specialized Multi-source Retrieval Agents execute simultaneous, modality-specific retrieval using plug-and-play modules targeting vector databases (unstructured text), graph databases (relational/knowledge graph data), and web-based sources, returning modality-aligned evidence sets
- Step 3 - Decision and Answer Synthesis: The Decision Agent aggregates multi-source answers using consistency voting to identify high-confidence responses, applies Expert Model Refinement to resolve discrepancies between conflicting retrieval results, and produces a final integrated answer
System Components
Top-tier agent that dissects complex queries into sub-tasks using semantic-aware query rewriting to capture intent and schema-guided context augmentation to align sub-tasks with the structure of available data sources
Middle-tier parallel agents using plug-and-play modality-specific modules to retrieve evidence from vector databases (dense text retrieval), graph databases (relational and knowledge graph traversal), and web-based sources simultaneously
Bottom-tier synthesis agent that applies consistency voting across multi-source answers to select high-agreement responses and uses Expert Model Refinement to reconcile conflicting evidence and produce a final, coherent answer
Sub-component of the Decomposition Agent that reformulates ambiguous or complex queries into clearer, intent-preserving forms suitable for downstream specialized retrieval
Sub-component that enriches sub-tasks with structural metadata from available data schemas, improving retrieval precision across heterogeneous data sources
Post-voting refinement step within the Decision Agent that leverages specialized models to resolve cases where consistency voting yields low confidence or contradictory results
Results
| Metric/Benchmark | Baseline RAG | HM-RAG | Delta |
|---|---|---|---|
| Answer Accuracy (ScienceQA + CrisisMMD) | Baseline RAG | Baseline + 12.95% | +12.95% |
| Question Classification Accuracy | Baseline RAG | Baseline + 3.56% | +3.56% |
| Zero-shot Performance (ScienceQA) | Non-SOTA | State-of-the-Art | SOTA |
| Zero-shot Performance (CrisisMMD) | Non-SOTA | State-of-the-Art | SOTA |
Key Takeaways
- Hierarchical agent specialization is a practical design pattern for RAG systems: separating query decomposition, retrieval, and synthesis into dedicated agents reduces task complexity per agent and improves overall system accuracy on multi-hop and multimodal queries
- Plug-and-play modular retrieval design allows practitioners to incrementally extend HM-RAG with new data modalities (e.g., adding a new graph or vector source) without retraining or restructuring the entire pipeline, making it suitable for enterprise environments with evolving data ecosystems
- Consistency voting with expert refinement is a robust strategy for handling retrieval conflicts in multimodal systems—practitioners building RAG pipelines over heterogeneous sources should consider ensemble-style answer reconciliation rather than naive concatenation or single-source prioritization
Abstract
While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of a three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems.