VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

VDocRAG is a unified RAG framework that processes visually-rich documents (PDFs, PPTXs with charts and tables) directly as images, avoiding information loss from text parsing while using novel self-supervised pre-training to align visual and textual representations for retrieval.

Problem Statement

Conventional text-based RAG systems fail on visually-rich documents because parsing PDFs/PPTXs to text loses critical information embedded in charts, tables, and layouts. Existing document QA benchmarks are fragmented across closed-domain settings, lacking a unified resource for open-domain evaluation. There is no established framework for adapting large vision-language models to perform dense retrieval over heterogeneous document corpora.

Key Novelty

VDocRAG framework: end-to-end RAG pipeline that ingests documents as unified image representations, preserving visual modalities without lossy text conversion
Novel self-supervised pre-training tasks that compress visual document information into dense token embeddings aligned with textual content, adapting VLMs for retrieval
OpenDocVQA: the first unified open-domain document visual question answering benchmark collection covering diverse document types, formats, and modalities for training and evaluation

Evaluation Highlights

VDocRAG substantially outperforms conventional text-based RAG baselines across diverse document types including those with charts, tables, and mixed-modality content
Strong generalization capability demonstrated across varied document formats (PDF, PPTX) and question types in the OpenDocVQA benchmark

Breakthrough Assessment

7/10 VDocRAG addresses a real and underserved gap by unifying multimodal document retrieval and QA into a single image-based RAG paradigm with purpose-built pre-training, representing a significant practical advance; however, the core idea of image-based document understanding is an extension of existing VLM capabilities rather than a fundamental paradigm shift.

Methodology

Step 1 — Document Ingestion: Convert all documents (PDF, PPTX, etc.) into a unified image format to preserve visual structure, layout, charts, and tables without lossy text parsing
Step 2 — Visual Retriever Pre-training: Apply novel self-supervised tasks on large VLMs to compress page-level visual information into dense token representations aligned with in-document textual content, enabling effective similarity-based retrieval
Step 3 — RAG Pipeline Execution: At inference, retrieve the most relevant document images for a given query using the pre-trained visual retriever, then pass retrieved pages to a vision-language reader model to generate the final answer

System Components

VDocRAG Framework

End-to-end RAG system that operates over document images rather than extracted text, unifying retrieval and generation across mixed modalities and formats

Visual Dense Retriever

A VLM adapted via self-supervised pre-training to encode document page images into dense vector representations for similarity-based retrieval

Self-Supervised Pre-training Tasks

Novel training objectives that align visual document representations with textual content within pages, enabling the model to bridge the visual-textual gap for retrieval

OpenDocVQA

A unified benchmark collection for open-domain document visual question answering, aggregating diverse document types and formats to enable standardized training and evaluation

Vision-Language Reader

A generative VLM that takes retrieved document page images and a query to produce a final natural language answer

Results

Metric/Benchmark	Text-based RAG Baseline	VDocRAG	Delta
Open-domain Doc VQA (charts/tables)	Lower accuracy due to parsing loss	Substantially higher accuracy	Significant improvement
Open-domain Doc VQA (mixed formats)	Degrades on non-text-extractable content	Robust across PDF & PPTX	Strong generalization gain
Retrieval on visually-rich documents	Misses visual layout/content cues	Dense visual retrieval preserves all modalities	Qualitatively superior

Key Takeaways

For practitioners building enterprise document QA systems, treating document pages as images rather than parsed text is a superior strategy for preserving information in charts, tables, and complex layouts
Self-supervised pre-training that aligns visual page representations with in-document text is a practical and scalable approach to adapting off-the-shelf VLMs for retrieval without requiring expensive labeled retrieval data
OpenDocVQA provides ML teams with a ready-made, diverse benchmark for evaluating and comparing multimodal RAG systems on real-world document types, enabling more rigorous and reproducible research

Abstract

We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually- rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.