← Back to Papers

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

Thang Nguyen, Peter Chin, Yu-Wing Tai
arXiv.org | 2025
MA-RAG introduces a multi-agent framework for Retrieval-Augmented Generation where specialized agents (Planner, Step Definer, Extractor, QA) collaboratively decompose and solve complex information-seeking tasks via chain-of-thought reasoning. This modular orchestration significantly outperforms both standalone LLMs and existing RAG pipelines across multi-hop and ambiguous QA benchmarks.

Problem Statement

Conventional RAG systems struggle with complex, multi-hop, and ambiguous queries because they rely on monolithic end-to-end pipelines or isolated component improvements that lack structured reasoning and inter-component communication. Ambiguities in queries propagate through the pipeline unchecked, degrading retrieval quality and answer synthesis. There is also a lack of interpretability in how retrieved evidence is used to form final answers.

Key Novelty

  • Multi-agent decomposition of the RAG pipeline into four specialized roles (Planner, Step Definer, Extractor, QA Agent) each handling a distinct reasoning stage, enabling modular interpretability
  • Collaborative chain-of-thought prompting between agents that allows intermediate reasoning to be communicated and refined across pipeline stages rather than processed in isolation
  • Zero-shot generalization to specialized domains (e.g., medical QA) achieving competitive performance against domain-specific fine-tuned models without any domain adaptation

Evaluation Highlights

  • LLaMA3-8B with MA-RAG surpasses larger standalone LLMs on multi-hop benchmarks (HotpotQA, 2WikimQA), demonstrating strong efficiency gains at small model scale
  • LLaMA3-70B and GPT-4o-mini variants of MA-RAG achieve state-of-the-art results on challenging multi-hop QA datasets including HotpotQA and 2WikimQA, outperforming all existing RAG baselines across NQ, TriviaQA, HotpotQA, and 2WikimQA

Breakthrough Assessment

6/10 MA-RAG is a solid and well-executed contribution that meaningfully advances RAG systems through principled multi-agent collaboration and CoT communication, with strong empirical results. However, the core ideas of agent specialization and chain-of-thought prompting are not entirely novel, and the work is an important engineering and systems advance rather than a fundamental paradigm shift.

Methodology

  1. The Planner agent receives an ambiguous or complex query and decomposes it into a structured plan of sub-tasks, disambiguating intent and identifying the reasoning steps required
  2. The Step Definer and Extractor agents operationalize each sub-task by formulating targeted retrieval queries and extracting relevant evidence from retrieved documents, passing intermediate chain-of-thought reasoning to subsequent agents
  3. The QA Agent synthesizes the collected evidence and intermediate reasoning traces into a final coherent answer, leveraging the full chain-of-thought context accumulated across all prior agents

System Components

Planner Agent

Decomposes the input query into a structured multi-step reasoning plan, handling query disambiguation and identifying sub-goals critical for multi-hop reasoning

Step Definer Agent

Translates the planner's high-level steps into concrete, actionable retrieval sub-queries and defines the scope of evidence needed for each reasoning step

Extractor Agent

Executes retrieval and extracts relevant evidence from retrieved documents for each defined step, producing intermediate chain-of-thought reasoning outputs

QA Agent

Receives the accumulated evidence and reasoning traces from all prior agents and synthesizes a final, grounded answer; performance here is highly sensitive to model capacity

Results

Benchmark Best Existing RAG Baseline MA-RAG (LLaMA3-70B/GPT-4o-mini) Delta
HotpotQA (multi-hop) Prior SOTA RAG New SOTA Significant improvement
2WikimQA (multi-hop) Prior SOTA RAG New SOTA Significant improvement
NQ (open-domain) Standalone LLM / RAG baseline Outperforms all baselines Positive across model scales
TriviaQA (open-domain) Standalone LLM / RAG baseline Outperforms all baselines Positive across model scales
Medical QA (domain-specific) Domain-fine-tuned models Competitive without fine-tuning No domain adaptation needed

Key Takeaways

  • Practitioners can achieve large-model-level QA performance with smaller models (e.g., LLaMA3-8B) by replacing monolithic RAG pipelines with a multi-agent architecture, reducing inference costs for production deployment
  • The ablation finding that the Planner and Extractor agents are the most critical components suggests that investment in query decomposition and precise evidence extraction yields the highest ROI when improving RAG systems
  • MA-RAG's zero-shot domain generalization (e.g., medical QA) demonstrates that well-structured agent orchestration with CoT can substitute for expensive domain-specific fine-tuning, making it highly practical for low-resource or rapidly evolving domains

Abstract

We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, each responsible for a distinct stage of the RAG pipeline. By decomposing tasks into subtasks such as query disambiguation, evidence extraction, and answer synthesis, and enabling agents to communicate intermediate reasoning via chain-of-thought prompting, MA-RAG progressively refines retrieval and synthesis while maintaining modular interpretability. Extensive experiments on multi-hop and ambiguous QA benchmarks, including NQ, HotpotQA, 2WikimQA, and TriviaQA, demonstrate that MA-RAG significantly outperforms standalone LLMs and existing RAG methods across all model scales. Notably, even a small LLaMA3-8B model equipped with MA-RAG surpasses larger standalone LLMs, while larger variants (LLaMA3-70B and GPT-4o-mini) set new state-of-the-art results on challenging multi-hop datasets. Ablation studies reveal that both the planner and extractor agents are critical for multi-hop reasoning, and that high-capacity models are especially important for the QA agent to synthesize answers effectively. Beyond general-domain QA, MA-RAG generalizes to specialized domains such as medical QA, achieving competitive performance against domain-specific models without any domain-specific fine-tuning. Our results highlight the effectiveness of collaborative, modular reasoning in retrieval-augmented systems: MA-RAG not only improves answer accuracy and robustness but also provides interpretable intermediate reasoning steps, establishing a new paradigm for efficient and reliable multi-agent RAG.

Generated on 2026-02-21 using Claude