Retrieval-Augmented Generation with Conflicting Evidence
Problem Statement
Existing RAG systems struggle when multiple retrieved documents contain conflicting, ambiguous, or misleading information, yet prior work addresses only one challenge at a time (e.g., ambiguity OR noise OR misinformation). Real-world retrieval scenarios routinely combine all three, leaving current baselines brittle—Llama3.3-70B-Instruct scores only 32.60 exact match on the proposed RAMDocs benchmark. There is no unified benchmark or method that evaluates and handles this combined challenge.
Key Novelty
- RAMDocs dataset: a new benchmark simulating realistic multi-factor conflicting evidence scenarios combining ambiguity, misinformation, and noise in retrieved documents
- MADAM-RAG: a multi-agent debate framework where LLM agents argue over answer merits across multiple rounds, with an aggregator that collates disambiguated answers while discarding misinformation and noise
- Joint treatment of ambiguity, misinformation, and noise in a single unified RAG evaluation and methodology, exposing gaps that isolated benchmarks miss
Evaluation Highlights
- On AmbigDocs (ambiguous query resolution), MADAM-RAG improves over strong RAG baselines by up to 11.40% absolute with both closed and open-source models
- On FaithEval (misinformation suppression), MADAM-RAG improves by up to 15.80% absolute using Llama3.3-70B-Instruct
Breakthrough Assessment
Methodology
- Construct RAMDocs by combining documents with varying levels of ambiguity (multiple valid answers for the same query), misinformation (documents with factually incorrect content), and noise (irrelevant documents), simulating realistic retrieval conditions
- Deploy multiple LLM agents, each receiving retrieved documents and tasked with proposing and defending answers across multiple debate rounds, exposing disagreements caused by conflicting evidence
- An aggregator LLM collates outputs from all debating agents, associating answers with disambiguated entities, discarding claims supported only by misinformation or noise, and producing the final response
System Components
A benchmark dataset combining ambiguous queries, misinformation documents, and noisy/irrelevant documents to simulate complex real-world RAG scenarios for evaluation
Multiple LLM agents that iteratively debate the merits of candidate answers over several rounds, surfacing conflicting evidence and reinforcing correct answers through structured argumentation
A coordinating LLM that synthesizes the debate outputs, collating valid answers for disambiguated entities while filtering out misinformation-supported and noise-supported claims
Results
| Benchmark | RAG Baseline | MADAM-RAG | Delta |
|---|---|---|---|
| AmbigDocs (ambiguity) | Baseline (strong RAG) | Up to +11.40% absolute | +11.40% |
| FaithEval (misinformation) | Llama3.3-70B RAG baseline | Up to +15.80% absolute | +15.80% |
| RAMDocs (combined) | Llama3.3-70B: 32.60 EM | Partially improved | Gap remains |
Key Takeaways
- Production RAG pipelines should be evaluated against combined ambiguity + misinformation + noise scenarios, not just isolated challenges—RAMDocs provides a ready-made benchmark for this
- Multi-agent debate is an effective architectural pattern for improving RAG robustness: having multiple agents argue over retrieved evidence and an aggregator filter the results outperforms single-agent RAG by meaningful margins
- Even state-of-the-art 70B models score only ~32.60 EM on RAMDocs, signaling that conflicting evidence in realistic retrieval settings remains an open and important research problem requiring further work, especially under imbalanced evidence ratios
Abstract
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.