← Back to Papers

A Survey of Multimodal Retrieval-Augmented Generation

Lang Mei, Siyu Mo, Zhihan Yang, Chong Chen
arXiv.org | 2025
Multimodal Retrieval-Augmented Generation (MRAG) extends traditional text-only RAG by integrating multimodal data (text, images, videos) into both retrieval and generation pipelines, enabling more grounded and accurate responses. This survey provides a comprehensive review of MRAG components, datasets, evaluation methods, challenges, and future directions.

Problem Statement

Traditional RAG systems are limited to textual knowledge retrieval, which fails in tasks requiring visual or cross-modal understanding and leads to hallucinations in multimodal contexts. LLMs lack mechanisms to ground responses in diverse, real-world data formats beyond text. MRAG addresses these gaps by enabling retrieval and generation across heterogeneous modalities, improving factual accuracy and contextual relevance.

Key Novelty

  • First comprehensive survey systematically organizing the MRAG landscape including retrieval, fusion, and generation components across modalities
  • Structured taxonomy of MRAG datasets and evaluation benchmarks, providing a reference map for practitioners building multimodal QA and retrieval systems
  • Identification of open challenges and future research directions specific to MRAG, including cross-modal alignment, scalability, and hallucination mitigation

Evaluation Highlights

  • MRAG systems qualitatively outperform text-only RAG in visual question answering and multimodal knowledge-grounded tasks
  • Survey consolidates evidence across recent studies showing reduced hallucination rates and improved response accuracy when visual context is retrieved alongside text

Breakthrough Assessment

5/10 As a survey paper, this work makes a solid organizational and reference contribution to the field by consolidating fragmented MRAG research, but it does not introduce a novel method or empirical breakthrough; its value lies in accessibility and roadmap-setting for practitioners.

Methodology

  1. Systematically review and categorize existing MRAG literature, organizing works by retrieval modality (text, image, video), fusion strategy, and generation approach
  2. Analyze and compare publicly available MRAG datasets and evaluation benchmarks, noting their coverage, scale, and suitability for different task types
  3. Synthesize findings to identify current limitations, common failure modes (e.g., cross-modal misalignment, hallucinations), and promising future research directions

System Components

Multimodal Retriever

Retrieves relevant text, image, or video chunks from external knowledge bases using embedding-based or cross-modal similarity search

Cross-Modal Fusion Module

Integrates retrieved multimodal evidence with the query context before or during generation, aligning representations across modalities

Multimodal Generator

A large language or vision-language model that conditions its response on both the original query and the retrieved multimodal context to produce grounded outputs

Evaluation Benchmarks

Datasets and metrics (e.g., multimodal QA benchmarks) used to assess retrieval precision, answer accuracy, and hallucination rates in MRAG systems

Results

Aspect Text-only RAG MRAG Delta
Visual QA Accuracy Limited (no visual grounding) Improved (visual context retrieved) Qualitative gain
Hallucination Rate Higher in multimodal tasks Reduced via multimodal grounding Qualitative reduction
Modality Coverage Text only Text + Image + Video Expanded scope
Contextual Relevance Moderate Higher in vision-language tasks Qualitative improvement

Key Takeaways

  • Practitioners building QA or knowledge-grounded systems should consider MRAG pipelines when tasks involve visual content, as text-only RAG systematically underperforms in such scenarios
  • Cross-modal alignment between retrieved visual and textual evidence remains the key technical bottleneck — selecting or fine-tuning encoders with strong joint embedding spaces (e.g., CLIP-based) is critical
  • This survey serves as a practical entry point for researchers to identify relevant datasets and benchmarks before designing MRAG experiments, saving significant literature search overhead

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) enhances large language models (LLMs) by integrating multimodal data (text, images, videos) into retrieval and generation processes, overcoming the limitations of text-only Retrieval-Augmented Generation (RAG). While RAG improves response accuracy by incorporating external textual knowledge, MRAG extends this framework to include multimodal retrieval and generation, leveraging contextual information from diverse data types. This approach reduces hallucinations and enhances question-answering systems by grounding responses in factual, multimodal knowledge. Recent studies show MRAG outperforms traditional RAG, especially in scenarios requiring both visual and textual understanding. This survey reviews MRAG's essential components, datasets, evaluation methods, and limitations, providing insights into its construction and improvement. It also identifies challenges and future research directions, highlighting MRAG's potential to revolutionize multimodal information retrieval and generation. By offering a comprehensive perspective, this work encourages further exploration into this promising paradigm.

Generated on 2026-02-21 using Claude