← Back to Papers

Test-Time Computing for Referring Multimodal Large Language Models

Mingrui Wu, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Zhiyuan Liu, Liujuan Cao, Ming-Ming Cheng, Rongrong Ji
2026
ControlMLLM++ is a test-time adaptation framework that injects optimizable visual prompt tokens into frozen MLLMs to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. It leverages cross-modal attention maps as semantic guides to steer model focus toward user-specified visual regions.

Problem Statement

Existing MLLMs struggle with precise region-level visual grounding and referring tasks without expensive fine-tuning or architectural modifications. Training-based approaches require significant compute and risk catastrophic forgetting, while frozen models lack mechanisms to dynamically attend to user-specified spatial regions. There is a need for inference-time methods that can adapt frozen MLLMs to fine-grained spatial reasoning across diverse visual prompt modalities.

Key Novelty

  • Test-time optimization of learnable visual token modifiers via a task-specific energy function, enabling region-based control of frozen MLLMs without any parameter updates to the base model
  • Optim++: An improved optimization strategy that enhances stability during inference-time gradient-based visual prompt optimization
  • PromptDebias: A mechanism to mitigate language prompt biases that can skew attention away from visually-specified regions during test-time adaptation

Evaluation Highlights

  • Strong out-of-domain generalization across diverse visual prompt types (bounding boxes, masks, scribbles, points) without task-specific training
  • Improved interpretability through cross-modal attention maps that visually validate semantic correspondence between text tokens and specified image regions

Breakthrough Assessment

6/10 The paper presents a solid and practically useful contribution by enabling training-free, test-time spatial control over frozen MLLMs, addressing a real deployment constraint. However, gradient-based test-time optimization is an established paradigm, and the novelty is primarily in the careful engineering of the energy function, optimization stability, and debiasing rather than a fundamental algorithmic shift.

Methodology

  1. Step 1 - Cross-modal attention analysis: Extract attention maps between textual tokens and visual patch tokens from the frozen MLLM to identify semantic correspondences and use them as optimization targets
  2. Step 2 - Latent visual token modifier optimization: At inference time, optimize a learnable visual token modifier (not model weights) using a task-specific energy function that encourages cross-modal attention to concentrate on user-specified regions (box, mask, scribble, or point)
  3. Step 3 - Stabilized and debiased inference: Apply Optim++ for numerically stable gradient updates and PromptDebias to counteract language-prior biases, then pass the modified visual tokens through the frozen MLLM for final generation

System Components

Latent Visual Token Modifier

A small set of learnable perturbation vectors added to visual token embeddings at test time, optimized per-instance to redirect cross-modal attention toward user-specified spatial regions

Task-Specific Energy Function

An objective that measures alignment between the MLLM's cross-modal attention maps and the user-provided visual prompt region, guiding the optimization of the visual token modifier

Optim++

An improved optimization strategy that improves convergence stability and gradient quality during test-time inference-time optimization of the visual prompt

PromptDebias

A debiasing mechanism that identifies and suppresses language-driven attention biases that could override visual region signals during test-time optimization

Multi-format Visual Prompt Support

A unified interface that converts diverse spatial prompt types (bounding boxes, segmentation masks, scribbles, point clicks) into a common representation for the energy function

Results

Metric/Benchmark Baseline (Frozen MLLM) This Paper (ControlMLLM++) Delta
Region-level visual reasoning accuracy Lower (no spatial control) Higher (test-time adapted) Positive improvement
Out-of-domain generalization Poor (task-specific fine-tuning needed) Strong (training-free) Qualitative gain
Visual prompt type coverage Single modality (if any) 4 types: box, mask, scribble, point Broader coverage
Model modification required Fine-tuning required for grounding None (frozen weights) Zero training cost

Key Takeaways

  • Test-time visual prompt optimization is a practical alternative to fine-tuning for adding spatial grounding capabilities to already-deployed MLLMs, reducing infrastructure and compute costs
  • Cross-modal attention maps in MLLMs are semantically meaningful and can be used as both a diagnostic tool and an optimization signal for region-level control — practitioners should monitor these maps when debugging grounding failures
  • Language prompt biases can actively interfere with visual region control in MLLMs; explicit debiasing (PromptDebias) is a necessary engineering consideration when designing test-time or inference-time adaptation systems

Abstract

We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.

Generated on 2026-03-02 using Claude