← Back to Papers

Retrieval-Augmented Generation with Conflicting Evidence

Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
arXiv.org | 2025
RAG systems must simultaneously handle ambiguous queries, misinformation, and noisy documents rather than treating these as isolated problems. MADAM-RAG, a multi-agent debate framework, jointly addresses these conflicting evidence challenges to improve factuality and robustness.

Problem Statement

Existing RAG systems struggle when multiple retrieved documents contain conflicting, ambiguous, or misleading information, yet prior work addresses only one challenge at a time (e.g., ambiguity OR noise OR misinformation). Real-world retrieval scenarios routinely combine all three, leaving current baselines brittle—Llama3.3-70B-Instruct scores only 32.60 exact match on the proposed RAMDocs benchmark. There is no unified benchmark or method that evaluates and handles this combined challenge.

Key Novelty

  • RAMDocs dataset: a new benchmark simulating realistic multi-factor conflicting evidence scenarios combining ambiguity, misinformation, and noise in retrieved documents
  • MADAM-RAG: a multi-agent debate framework where LLM agents argue over answer merits across multiple rounds, with an aggregator that collates disambiguated answers while discarding misinformation and noise
  • Joint treatment of ambiguity, misinformation, and noise in a single unified RAG evaluation and methodology, exposing gaps that isolated benchmarks miss

Evaluation Highlights

  • On AmbigDocs (ambiguous query resolution), MADAM-RAG improves over strong RAG baselines by up to 11.40% absolute with both closed and open-source models
  • On FaithEval (misinformation suppression), MADAM-RAG improves by up to 15.80% absolute using Llama3.3-70B-Instruct

Breakthrough Assessment

6/10 The paper makes a solid, practically relevant contribution by unifying previously isolated RAG challenges into a single benchmark and multi-agent solution with meaningful performance gains, but the multi-agent debate paradigm is not entirely novel and a substantial performance gap remains on the hardest settings.

Methodology

  1. Construct RAMDocs by combining documents with varying levels of ambiguity (multiple valid answers for the same query), misinformation (documents with factually incorrect content), and noise (irrelevant documents), simulating realistic retrieval conditions
  2. Deploy multiple LLM agents, each receiving retrieved documents and tasked with proposing and defending answers across multiple debate rounds, exposing disagreements caused by conflicting evidence
  3. An aggregator LLM collates outputs from all debating agents, associating answers with disambiguated entities, discarding claims supported only by misinformation or noise, and producing the final response

System Components

RAMDocs

A benchmark dataset combining ambiguous queries, misinformation documents, and noisy/irrelevant documents to simulate complex real-world RAG scenarios for evaluation

MADAM-RAG (Multi-Agent Debate)

Multiple LLM agents that iteratively debate the merits of candidate answers over several rounds, surfacing conflicting evidence and reinforcing correct answers through structured argumentation

Aggregator Agent

A coordinating LLM that synthesizes the debate outputs, collating valid answers for disambiguated entities while filtering out misinformation-supported and noise-supported claims

Results

Benchmark RAG Baseline MADAM-RAG Delta
AmbigDocs (ambiguity) Baseline (strong RAG) Up to +11.40% absolute +11.40%
FaithEval (misinformation) Llama3.3-70B RAG baseline Up to +15.80% absolute +15.80%
RAMDocs (combined) Llama3.3-70B: 32.60 EM Partially improved Gap remains

Key Takeaways

  • Production RAG pipelines should be evaluated against combined ambiguity + misinformation + noise scenarios, not just isolated challenges—RAMDocs provides a ready-made benchmark for this
  • Multi-agent debate is an effective architectural pattern for improving RAG robustness: having multiple agents argue over retrieved evidence and an aggregator filter the results outperforms single-agent RAG by meaningful margins
  • Even state-of-the-art 70B models score only ~32.60 EM on RAMDocs, signaling that conflicting evidence in realistic retrieval settings remains an open and important research problem requiring further work, especially under imbalanced evidence ratios

Abstract

Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.

Generated on 2026-02-21 using Claude