Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval

Nemotron ColEmbed V2 is a family of VLM-based late interaction embedding models (3B, 4B, 8B parameters) that achieve state-of-the-art performance on visual document retrieval benchmarks by combining cluster-based sampling, hard-negative mining, bidirectional attention, and model merging techniques.

Problem Statement

Enterprise RAG pipelines need to retrieve relevant content from large catalogs of visual documents (PDFs, slides) without expensive OCR preprocessing. Existing dense retrieval models lose visual information when converting documents to text, while early VLM-based embedding models underperform or are computationally impractical. This work addresses the accuracy-storage-compute tradeoffs that make late interaction models difficult to deploy at scale.

Key Novelty

Three VLM-based late interaction embedding model variants (3B, 4B, 8B) built on NVIDIA Eagle 2 and Qwen3-VL backbones, achieving #1 on ViDoRe V3 leaderboard with 63.42 NDCG@10
Comprehensive training recipe combining cluster-based hard-negative mining, bidirectional attention modification, and model merging for visual document embeddings
Practical engineering analysis of compute and storage challenges in late interaction mechanisms with experiments on lower-dimension embeddings to balance accuracy and storage costs

Evaluation Highlights

8B model achieves NDCG@10 of 63.42 on ViDoRe V3 benchmark, ranking #1 on the leaderboard as of February 2026
Three model variants (3B, 4B, 8B) provide a scalable family covering different compute budgets while maintaining top-tier retrieval performance

Breakthrough Assessment

6/10 This is a solid engineering and training contribution that advances the state-of-the-art on a well-defined benchmark, but it is primarily an applied combination of existing techniques (late interaction, VLMs, hard-negative mining) rather than a fundamental methodological innovation. Its practical value for enterprise visual document RAG is high.

Methodology

Data processing: curate and sample visual document datasets using cluster-based sampling to ensure diversity, and mine hard negatives to create challenging training pairs that improve discriminative embedding quality
Model training: fine-tune pre-trained VLMs (Eagle 2 3B, Qwen3-VL 4B/8B) with bidirectional attention (replacing causal attention) and a late interaction objective (ColBERT-style token-level similarity) to produce rich multi-vector document and query representations
Post-training and optimization: apply model merging to combine complementary model checkpoints, and experiment with embedding dimension reduction to trade off retrieval accuracy against storage and compute costs for the late interaction index

System Components

Late Interaction (ColBERT-style)

Represents queries and documents as multi-vector token-level embeddings and computes MaxSim-based similarity at retrieval time, enabling finer-grained matching than single-vector dense retrieval

Bidirectional Attention

Replaces the causal (autoregressive) attention mask in VLM backbones with full bidirectional attention, allowing each token to attend to all others and produce richer document representations

Cluster-Based Hard Negative Mining

Samples training data from document clusters and mines hard negatives (documents that are similar but not relevant) to force the model to learn fine-grained discriminative embeddings

Model Merging

Combines weights from multiple trained checkpoints to improve generalization and robustness across diverse document types and query styles

Dimension Reduction for Storage

Experiments with lower-dimensional embedding projections to reduce the storage footprint of the late interaction multi-vector index while minimizing accuracy loss

Results

Benchmark	Prior SOTA	Nemotron ColEmbed V2 (8B)	Delta
ViDoRe V3 NDCG@10	< 63.42 (prev. leaderboard leader)	63.42	#1 ranked
Model scale coverage	Typically single model size	3B / 4B / 8B variants	Full efficiency-accuracy tradeoff range
Storage efficiency	Full-dim late interaction	Reduced-dim variants available	Configurable accuracy-storage tradeoff

Key Takeaways

For enterprise visual document RAG, VLM-based late interaction models can eliminate OCR preprocessing while achieving superior retrieval accuracy — the 8B variant is a strong drop-in retrieval backbone for PDF/slide corpora
Bidirectional attention modification and model merging are practical post-hoc techniques that meaningfully improve embedding quality over naive VLM fine-tuning and are worth incorporating into any VLM-based embedding training pipeline
Late interaction models impose significant storage and compute overhead at index time; practitioners should profile embedding dimension reduction experiments early to find the accuracy-storage Pareto frontier suitable for their deployment constraints

Abstract

Retrieval-Augmented Generation (RAG) systems have been popular for generative applications, powering language models by injecting external knowledge. Companies have been trying to leverage their large catalog of documents (e.g. PDFs, presentation slides) in such RAG pipelines, whose first step is the retrieval component. Dense retrieval has been a popular approach, where embedding models are used to generate a dense representation of the user query that is closer to relevant content embeddings. More recently, VLM-based embedding models have become popular for visual document retrieval, as they preserve visual information and simplify the indexing pipeline compared to OCR text extraction. Motivated by the growing demand for visual document retrieval, we introduce Nemotron ColEmbed V2, a family of models that achieve state-of-the-art performance on the ViDoRe benchmarks. We release three variants - with 3B, 4B, and 8B parameters - based on pre-trained VLMs: NVIDIA Eagle 2 with Llama 3.2 3B backbone, Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, respectively. The 8B model ranks first on the ViDoRe V3 leaderboard as of February 03, 2026, achieving an average NDCG@10 of 63.42. We describe the main techniques used across data processing, training, and post-training - such as cluster-based sampling, hard-negative mining, bidirectional attention, late interaction, and model merging - that helped us build our top-performing models. We also discuss compute and storage engineering challenges posed by the late interaction mechanism and present experiments on how to balance accuracy and storage with lower dimension embeddings.