ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

ViDoRAG is a multi-agent RAG framework that uses GMM-based hybrid retrieval and iterative reasoning agents to handle complex question answering over visually rich documents. It is accompanied by ViDoSeek, a new benchmark specifically designed to evaluate RAG performance on dense visual documents requiring multi-step reasoning.

Problem Statement

Traditional RAG systems struggle with visually rich documents because they fail to effectively fuse textual and visual features during retrieval. Existing benchmarks focus on image-based QA rather than the full RAG pipeline (retrieval + comprehension + reasoning) over dense visual documents. Additionally, prior approaches allocate insufficient reasoning compute at test time, limiting their performance on complex multi-step queries.

Key Novelty

ViDoSeek benchmark: A new dataset explicitly designed to evaluate end-to-end RAG performance on visually rich documents requiring complex reasoning, filling a gap left by image-QA-only benchmarks
GMM-based hybrid multi-modal retrieval strategy that jointly models textual and visual feature distributions for more effective retrieval from visually rich documents
Iterative multi-agent reasoning workflow (exploration, summarization, reflection) that scales test-time compute in the RAG domain, eliciting deeper reasoning from the model

Evaluation Highlights

ViDoRAG outperforms existing methods by over 10% on the ViDoSeek benchmark, demonstrating both effectiveness and generalization
The iterative agent workflow provides a framework for test-time scaling in RAG, showing that allocating more reasoning tokens improves performance on complex visual document QA

Breakthrough Assessment

7/10 ViDoRAG makes a significant advance by combining multi-modal hybrid retrieval with iterative test-time reasoning agents and introducing a purpose-built benchmark, yielding 10%+ gains over prior work. However, the individual components (GMM mixing, reflection agents, RAG pipelines) are evolutionary rather than entirely new paradigms.

Methodology

Step 1 – Multi-modal indexing: Documents are processed to extract both textual and visual features, which are modeled jointly using a Gaussian Mixture Model to create a hybrid retrieval index
Step 2 – Hybrid retrieval: Given a query, the GMM-based strategy retrieves relevant document chunks by integrating scores from both textual and visual modalities, overcoming the limitations of purely visual or purely textual retrieval
Step 3 – Iterative agent reasoning: A multi-agent workflow iterates through exploration (generating candidate answers), summarization (condensing retrieved evidence), and reflection (critiquing and refining answers), scaling test-time compute to improve final response quality

System Components

ViDoSeek Benchmark

A novel dataset of visually rich documents with complex reasoning questions, designed to evaluate the full RAG pipeline including retrieval quality, comprehension, and multi-step reasoning

GMM-based Hybrid Retrieval

Uses Gaussian Mixture Models to jointly capture the distribution of textual and visual features, enabling more accurate retrieval from multi-modal document collections compared to unimodal methods

Exploration Agent

Generates initial candidate answers and retrieves relevant document regions by broadly searching the visual document space

Summarization Agent

Condenses and organizes the retrieved evidence from multiple document chunks into coherent intermediate representations

Reflection Agent

Critiques the current answer hypothesis, identifies gaps or errors, and triggers further retrieval or reasoning iterations to refine the final output

Results

Metric/Benchmark	Best Prior Method	ViDoRAG	Delta
ViDoSeek Accuracy	Baseline RAG methods	>10% higher	+10%+
Multi-modal Retrieval Quality	Purely visual retrieval	Improved (GMM hybrid)	Qualitative gain
Complex Reasoning Performance	Single-pass RAG inference	Iterative agent workflow	Improved via test-time scaling

Key Takeaways

For multi-modal RAG systems, fusing textual and visual retrieval signals via a probabilistic model (e.g., GMM) substantially outperforms relying on either modality alone — practitioners should move beyond purely visual or purely text-based retrieval for document-heavy applications
Test-time compute scaling (allocating more reasoning tokens through iterative explore-summarize-reflect loops) is a viable and impactful strategy in RAG settings, not just in standalone LLM inference — this suggests architectural patterns for building more robust document QA pipelines
Benchmark design matters: evaluating RAG on image-QA tasks alone is insufficient; practitioners building production RAG systems over PDFs, reports, or presentations should use benchmarks that test the full retrieval-to-reasoning pipeline on dense visual documents

Abstract

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.