ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Problem Statement
Traditional RAG systems struggle with visually rich documents because they fail to effectively fuse textual and visual features during retrieval. Existing benchmarks focus on image-based QA rather than the full RAG pipeline (retrieval + comprehension + reasoning) over dense visual documents. Additionally, prior approaches allocate insufficient reasoning compute at test time, limiting their performance on complex multi-step queries.
Key Novelty
- ViDoSeek benchmark: A new dataset explicitly designed to evaluate end-to-end RAG performance on visually rich documents requiring complex reasoning, filling a gap left by image-QA-only benchmarks
- GMM-based hybrid multi-modal retrieval strategy that jointly models textual and visual feature distributions for more effective retrieval from visually rich documents
- Iterative multi-agent reasoning workflow (exploration, summarization, reflection) that scales test-time compute in the RAG domain, eliciting deeper reasoning from the model
Evaluation Highlights
- ViDoRAG outperforms existing methods by over 10% on the ViDoSeek benchmark, demonstrating both effectiveness and generalization
- The iterative agent workflow provides a framework for test-time scaling in RAG, showing that allocating more reasoning tokens improves performance on complex visual document QA
Breakthrough Assessment
Methodology
- Step 1 – Multi-modal indexing: Documents are processed to extract both textual and visual features, which are modeled jointly using a Gaussian Mixture Model to create a hybrid retrieval index
- Step 2 – Hybrid retrieval: Given a query, the GMM-based strategy retrieves relevant document chunks by integrating scores from both textual and visual modalities, overcoming the limitations of purely visual or purely textual retrieval
- Step 3 – Iterative agent reasoning: A multi-agent workflow iterates through exploration (generating candidate answers), summarization (condensing retrieved evidence), and reflection (critiquing and refining answers), scaling test-time compute to improve final response quality
System Components
A novel dataset of visually rich documents with complex reasoning questions, designed to evaluate the full RAG pipeline including retrieval quality, comprehension, and multi-step reasoning
Uses Gaussian Mixture Models to jointly capture the distribution of textual and visual features, enabling more accurate retrieval from multi-modal document collections compared to unimodal methods
Generates initial candidate answers and retrieves relevant document regions by broadly searching the visual document space
Condenses and organizes the retrieved evidence from multiple document chunks into coherent intermediate representations
Critiques the current answer hypothesis, identifies gaps or errors, and triggers further retrieval or reasoning iterations to refine the final output
Results
| Metric/Benchmark | Best Prior Method | ViDoRAG | Delta |
|---|---|---|---|
| ViDoSeek Accuracy | Baseline RAG methods | >10% higher | +10%+ |
| Multi-modal Retrieval Quality | Purely visual retrieval | Improved (GMM hybrid) | Qualitative gain |
| Complex Reasoning Performance | Single-pass RAG inference | Iterative agent workflow | Improved via test-time scaling |
Key Takeaways
- For multi-modal RAG systems, fusing textual and visual retrieval signals via a probabilistic model (e.g., GMM) substantially outperforms relying on either modality alone — practitioners should move beyond purely visual or purely text-based retrieval for document-heavy applications
- Test-time compute scaling (allocating more reasoning tokens through iterative explore-summarize-reflect loops) is a viable and impactful strategy in RAG settings, not just in standalone LLM inference — this suggests architectural patterns for building more robust document QA pipelines
- Benchmark design matters: evaluating RAG on image-QA tasks alone is insufficient; practitioners building production RAG systems over PDFs, reports, or presentations should use benchmarks that test the full retrieval-to-reasoning pipeline on dense visual documents
Abstract
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.