Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Vision-aligned Latent Reasoning (VaLR) addresses visual information dilution in multi-modal LLMs during long-context reasoning by dynamically generating vision-aligned latent tokens before each Chain-of-Thought step, anchoring reasoning in perceptual cues from the latent space.

Problem Statement

Current MLLMs suffer from progressive dilution of visual information during long-context generation, causing them to lose perceptual grounding as reasoning chains grow longer. This bottleneck prevents effective test-time scaling for visually-intensive multi-step reasoning tasks. Existing approaches lack a mechanism to continuously re-anchor reasoning steps to visual features, limiting performance on benchmarks requiring precise spatial or perceptual understanding.

Key Novelty

Introduces vision-aligned latent tokens dynamically inserted before each CoT reasoning step to preserve visual grounding throughout long reasoning chains
Proposes an alignment training objective that matches intermediate MLLM embeddings with vision encoder embeddings, explicitly preventing visual knowledge decay
First demonstration of test-time scaling behavior in MLLMs for visually-grounded multi-step reasoning, a capability absent in prior multi-modal models

Evaluation Highlights

Achieves 52.9% on VSI-Bench vs. 33.0% for Qwen2.5-VL baseline, a 19.9 percentage point gain on spatial visual intelligence
Consistently outperforms existing approaches across multiple long-context understanding and precise visual perception benchmarks while exhibiting test-time scaling

Signal Assessment

7/10 VaLR addresses a fundamental and underexplored bottleneck in MLLMs (visual dilution during reasoning) with a principled yet simple mechanism, and uniquely enables test-time scaling for multi-modal reasoning — a capability that could significantly expand the practical utility of MLLMs on hard visual tasks.

Methodology

Identify that visual information progressively dilutes in MLLM hidden states during long CoT generation, limiting multi-step visual reasoning quality
Train the model to generate special vision-aligned latent tokens at each reasoning step by aligning the model's intermediate embeddings with corresponding vision encoder outputs via a distillation/alignment loss
At inference, these latent tokens are auto-regressively generated before each CoT step, dynamically re-injecting visual grounding and enabling test-time scaling through extended reasoning

System Components

Vision-aligned Latent Tokens

Special tokens generated dynamically before each reasoning step that encode perceptually grounded information derived from alignment with vision encoder representations

Intermediate Embedding Alignment

Training objective that minimizes discrepancy between MLLM intermediate hidden states and vision encoder embeddings, preventing visual knowledge decay over long contexts

Chain-of-Thought Integration

Framework that interleaves vision-aligned latent token generation with standard CoT reasoning steps, allowing the model to re-ground each reasoning step in visual features

Test-Time Scaling Mechanism

The latent token generation enables scaling inference compute, as more reasoning steps with fresh visual grounding consistently improve accuracy — a novel capability for MLLMs

Results

Benchmark	Baseline (Qwen2.5-VL)	VaLR	Delta
VSI-Bench (Spatial Visual Intelligence)	33.0%	52.9%	+19.9pp
Long-context Understanding Benchmarks	Prior SOTA	Consistent improvement	Positive across benchmarks
Test-Time Scaling Behavior	Not observed	Observed	New capability unlocked

Key Takeaways

Practitioners building MLLMs for tasks requiring multi-step visual reasoning (e.g., spatial QA, diagram understanding, embodied AI) should consider periodically re-injecting visual grounding tokens rather than relying solely on the initial visual encoding
The alignment loss between MLLM intermediate states and vision encoder outputs is a simple add-on that can be incorporated into existing MLLM training pipelines to reduce visual dilution without architectural overhaul
VaLR's test-time scaling property suggests that inference budget allocation strategies (common in text LLMs) can now be applied to visually-grounded reasoning, opening new directions for compute-optimal multi-modal inference

Abstract

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.