Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval
Problem Statement
Enterprise RAG pipelines need to retrieve relevant content from large catalogs of visual documents (PDFs, slides) without expensive OCR preprocessing. Existing dense retrieval models lose visual information when converting documents to text, while early VLM-based embedding models underperform or are computationally impractical. This work addresses the accuracy-storage-compute tradeoffs that make late interaction models difficult to deploy at scale.
Key Novelty
- Three VLM-based late interaction embedding model variants (3B, 4B, 8B) built on NVIDIA Eagle 2 and Qwen3-VL backbones, achieving #1 on ViDoRe V3 leaderboard with 63.42 NDCG@10
- Comprehensive training recipe combining cluster-based hard-negative mining, bidirectional attention modification, and model merging for visual document embeddings
- Practical engineering analysis of compute and storage challenges in late interaction mechanisms with experiments on lower-dimension embeddings to balance accuracy and storage costs
Evaluation Highlights
- 8B model achieves NDCG@10 of 63.42 on ViDoRe V3 benchmark, ranking #1 on the leaderboard as of February 2026
- Three model variants (3B, 4B, 8B) provide a scalable family covering different compute budgets while maintaining top-tier retrieval performance
Breakthrough Assessment
Methodology
- Data processing: curate and sample visual document datasets using cluster-based sampling to ensure diversity, and mine hard negatives to create challenging training pairs that improve discriminative embedding quality
- Model training: fine-tune pre-trained VLMs (Eagle 2 3B, Qwen3-VL 4B/8B) with bidirectional attention (replacing causal attention) and a late interaction objective (ColBERT-style token-level similarity) to produce rich multi-vector document and query representations
- Post-training and optimization: apply model merging to combine complementary model checkpoints, and experiment with embedding dimension reduction to trade off retrieval accuracy against storage and compute costs for the late interaction index
System Components
Represents queries and documents as multi-vector token-level embeddings and computes MaxSim-based similarity at retrieval time, enabling finer-grained matching than single-vector dense retrieval
Replaces the causal (autoregressive) attention mask in VLM backbones with full bidirectional attention, allowing each token to attend to all others and produce richer document representations
Samples training data from document clusters and mines hard negatives (documents that are similar but not relevant) to force the model to learn fine-grained discriminative embeddings
Combines weights from multiple trained checkpoints to improve generalization and robustness across diverse document types and query styles
Experiments with lower-dimensional embedding projections to reduce the storage footprint of the late interaction multi-vector index while minimizing accuracy loss
Results
| Benchmark | Prior SOTA | Nemotron ColEmbed V2 (8B) | Delta |
|---|---|---|---|
| ViDoRe V3 NDCG@10 | < 63.42 (prev. leaderboard leader) | 63.42 | #1 ranked |
| Model scale coverage | Typically single model size | 3B / 4B / 8B variants | Full efficiency-accuracy tradeoff range |
| Storage efficiency | Full-dim late interaction | Reduced-dim variants available | Configurable accuracy-storage tradeoff |
Key Takeaways
- For enterprise visual document RAG, VLM-based late interaction models can eliminate OCR preprocessing while achieving superior retrieval accuracy — the 8B variant is a strong drop-in retrieval backbone for PDF/slide corpora
- Bidirectional attention modification and model merging are practical post-hoc techniques that meaningfully improve embedding quality over naive VLM fine-tuning and are worth incorporating into any VLM-based embedding training pipeline
- Late interaction models impose significant storage and compute overhead at index time; practitioners should profile embedding dimension reduction experiments early to find the accuracy-storage Pareto frontier suitable for their deployment constraints
Abstract
Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.