MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

MA-RAG introduces a multi-agent framework for Retrieval-Augmented Generation where specialized agents (Planner, Step Definer, Extractor, QA) collaboratively decompose and solve complex information-seeking tasks via chain-of-thought reasoning. This modular orchestration significantly outperforms both standalone LLMs and existing RAG pipelines across multi-hop and ambiguous QA benchmarks.

Problem Statement

Conventional RAG systems struggle with complex, multi-hop, and ambiguous queries because they rely on monolithic end-to-end pipelines or isolated component improvements that lack structured reasoning and inter-component communication. Ambiguities in queries propagate through the pipeline unchecked, degrading retrieval quality and answer synthesis. There is also a lack of interpretability in how retrieved evidence is used to form final answers.

Key Novelty

Multi-agent decomposition of the RAG pipeline into four specialized roles (Planner, Step Definer, Extractor, QA Agent) each handling a distinct reasoning stage, enabling modular interpretability
Collaborative chain-of-thought prompting between agents that allows intermediate reasoning to be communicated and refined across pipeline stages rather than processed in isolation
Zero-shot generalization to specialized domains (e.g., medical QA) achieving competitive performance against domain-specific fine-tuned models without any domain adaptation

Evaluation Highlights

LLaMA3-8B with MA-RAG surpasses larger standalone LLMs on multi-hop benchmarks (HotpotQA, 2WikimQA), demonstrating strong efficiency gains at small model scale
LLaMA3-70B and GPT-4o-mini variants of MA-RAG achieve state-of-the-art results on challenging multi-hop QA datasets including HotpotQA and 2WikimQA, outperforming all existing RAG baselines across NQ, TriviaQA, HotpotQA, and 2WikimQA

Breakthrough Assessment

6/10 MA-RAG is a solid and well-executed contribution that meaningfully advances RAG systems through principled multi-agent collaboration and CoT communication, with strong empirical results. However, the core ideas of agent specialization and chain-of-thought prompting are not entirely novel, and the work is an important engineering and systems advance rather than a fundamental paradigm shift.

Methodology

The Planner agent receives an ambiguous or complex query and decomposes it into a structured plan of sub-tasks, disambiguating intent and identifying the reasoning steps required
The Step Definer and Extractor agents operationalize each sub-task by formulating targeted retrieval queries and extracting relevant evidence from retrieved documents, passing intermediate chain-of-thought reasoning to subsequent agents
The QA Agent synthesizes the collected evidence and intermediate reasoning traces into a final coherent answer, leveraging the full chain-of-thought context accumulated across all prior agents

System Components

Planner Agent

Decomposes the input query into a structured multi-step reasoning plan, handling query disambiguation and identifying sub-goals critical for multi-hop reasoning

Step Definer Agent

Translates the planner's high-level steps into concrete, actionable retrieval sub-queries and defines the scope of evidence needed for each reasoning step

Extractor Agent

Executes retrieval and extracts relevant evidence from retrieved documents for each defined step, producing intermediate chain-of-thought reasoning outputs

QA Agent

Receives the accumulated evidence and reasoning traces from all prior agents and synthesizes a final, grounded answer; performance here is highly sensitive to model capacity

Results

Benchmark	Best Existing RAG Baseline	MA-RAG (LLaMA3-70B/GPT-4o-mini)	Delta
HotpotQA (multi-hop)	Prior SOTA RAG	New SOTA	Significant improvement
2WikimQA (multi-hop)	Prior SOTA RAG	New SOTA	Significant improvement
NQ (open-domain)	Standalone LLM / RAG baseline	Outperforms all baselines	Positive across model scales
TriviaQA (open-domain)	Standalone LLM / RAG baseline	Outperforms all baselines	Positive across model scales
Medical QA (domain-specific)	Domain-fine-tuned models	Competitive without fine-tuning	No domain adaptation needed

Key Takeaways

Practitioners can achieve large-model-level QA performance with smaller models (e.g., LLaMA3-8B) by replacing monolithic RAG pipelines with a multi-agent architecture, reducing inference costs for production deployment
The ablation finding that the Planner and Extractor agents are the most critical components suggests that investment in query decomposition and precise evidence extraction yields the highest ROI when improving RAG systems
MA-RAG's zero-shot domain generalization (e.g., medical QA) demonstrates that well-structured agent orchestration with CoT can substitute for expensive domain-specific fine-tuning, making it highly practical for low-resource or rapidly evolving domains

Abstract

We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, each responsible for a distinct stage of the RAG pipeline. By decomposing tasks into subtasks such as query disambiguation, evidence extraction, and answer synthesis, and enabling agents to communicate intermediate reasoning via chain-of-thought prompting, MA-RAG progressively refines retrieval and synthesis while maintaining modular interpretability. Extensive experiments on multi-hop and ambiguous QA benchmarks, including NQ, HotpotQA, 2WikimQA, and TriviaQA, demonstrate that MA-RAG significantly outperforms standalone LLMs and existing RAG methods across all model scales. Notably, even a small LLaMA3-8B model equipped with MA-RAG surpasses larger standalone LLMs, while larger variants (LLaMA3-70B and GPT-4o-mini) set new state-of-the-art results on challenging multi-hop datasets. Ablation studies reveal that both the planner and extractor agents are critical for multi-hop reasoning, and that high-capacity models are especially important for the QA agent to synthesize answers effectively. Beyond general-domain QA, MA-RAG generalizes to specialized domains such as medical QA, achieving competitive performance against domain-specific models without any domain-specific fine-tuning. Our results highlight the effectiveness of collaborative, modular reasoning in retrieval-augmented systems: MA-RAG not only improves answer accuracy and robustness but also provides interpretable intermediate reasoning steps, establishing a new paradigm for efficient and reliable multi-agent RAG.