VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Problem Statement
Conventional text-based RAG systems fail on visually-rich documents because parsing PDFs/PPTXs to text loses critical information embedded in charts, tables, and layouts. Existing document QA benchmarks are fragmented across closed-domain settings, lacking a unified resource for open-domain evaluation. There is no established framework for adapting large vision-language models to perform dense retrieval over heterogeneous document corpora.
Key Novelty
- VDocRAG framework: end-to-end RAG pipeline that ingests documents as unified image representations, preserving visual modalities without lossy text conversion
- Novel self-supervised pre-training tasks that compress visual document information into dense token embeddings aligned with textual content, adapting VLMs for retrieval
- OpenDocVQA: the first unified open-domain document visual question answering benchmark collection covering diverse document types, formats, and modalities for training and evaluation
Evaluation Highlights
- VDocRAG substantially outperforms conventional text-based RAG baselines across diverse document types including those with charts, tables, and mixed-modality content
- Strong generalization capability demonstrated across varied document formats (PDF, PPTX) and question types in the OpenDocVQA benchmark
Breakthrough Assessment
Methodology
- Step 1 — Document Ingestion: Convert all documents (PDF, PPTX, etc.) into a unified image format to preserve visual structure, layout, charts, and tables without lossy text parsing
- Step 2 — Visual Retriever Pre-training: Apply novel self-supervised tasks on large VLMs to compress page-level visual information into dense token representations aligned with in-document textual content, enabling effective similarity-based retrieval
- Step 3 — RAG Pipeline Execution: At inference, retrieve the most relevant document images for a given query using the pre-trained visual retriever, then pass retrieved pages to a vision-language reader model to generate the final answer
System Components
End-to-end RAG system that operates over document images rather than extracted text, unifying retrieval and generation across mixed modalities and formats
A VLM adapted via self-supervised pre-training to encode document page images into dense vector representations for similarity-based retrieval
Novel training objectives that align visual document representations with textual content within pages, enabling the model to bridge the visual-textual gap for retrieval
A unified benchmark collection for open-domain document visual question answering, aggregating diverse document types and formats to enable standardized training and evaluation
A generative VLM that takes retrieved document page images and a query to produce a final natural language answer
Results
| Metric/Benchmark | Text-based RAG Baseline | VDocRAG | Delta |
|---|---|---|---|
| Open-domain Doc VQA (charts/tables) | Lower accuracy due to parsing loss | Substantially higher accuracy | Significant improvement |
| Open-domain Doc VQA (mixed formats) | Degrades on non-text-extractable content | Robust across PDF & PPTX | Strong generalization gain |
| Retrieval on visually-rich documents | Misses visual layout/content cues | Dense visual retrieval preserves all modalities | Qualitatively superior |
Key Takeaways
- For practitioners building enterprise document QA systems, treating document pages as images rather than parsed text is a superior strategy for preserving information in charts, tables, and complex layouts
- Self-supervised pre-training that aligns visual page representations with in-document text is a practical and scalable approach to adapting off-the-shelf VLMs for retrieval without requiring expensive labeled retrieval data
- OpenDocVQA provides ML teams with a ready-made, diverse benchmark for evaluating and comparing multimodal RAG systems on real-world document types, enabling more rigorous and reproducible research
Abstract
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually- rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.