Test-Time Computing for Referring Multimodal Large Language Models
Problem Statement
Existing MLLMs struggle with precise region-level visual grounding and referring tasks without expensive fine-tuning or architectural modifications. Training-based approaches require significant compute and risk catastrophic forgetting, while frozen models lack mechanisms to dynamically attend to user-specified spatial regions. There is a need for inference-time methods that can adapt frozen MLLMs to fine-grained spatial reasoning across diverse visual prompt modalities.
Key Novelty
- Test-time optimization of learnable visual token modifiers via a task-specific energy function, enabling region-based control of frozen MLLMs without any parameter updates to the base model
- Optim++: An improved optimization strategy that enhances stability during inference-time gradient-based visual prompt optimization
- PromptDebias: A mechanism to mitigate language prompt biases that can skew attention away from visually-specified regions during test-time adaptation
Evaluation Highlights
- Strong out-of-domain generalization across diverse visual prompt types (bounding boxes, masks, scribbles, points) without task-specific training
- Improved interpretability through cross-modal attention maps that visually validate semantic correspondence between text tokens and specified image regions
Breakthrough Assessment
Methodology
- Step 1 - Cross-modal attention analysis: Extract attention maps between textual tokens and visual patch tokens from the frozen MLLM to identify semantic correspondences and use them as optimization targets
- Step 2 - Latent visual token modifier optimization: At inference time, optimize a learnable visual token modifier (not model weights) using a task-specific energy function that encourages cross-modal attention to concentrate on user-specified regions (box, mask, scribble, or point)
- Step 3 - Stabilized and debiased inference: Apply Optim++ for numerically stable gradient updates and PromptDebias to counteract language-prior biases, then pass the modified visual tokens through the frozen MLLM for final generation
System Components
A small set of learnable perturbation vectors added to visual token embeddings at test time, optimized per-instance to redirect cross-modal attention toward user-specified spatial regions
An objective that measures alignment between the MLLM's cross-modal attention maps and the user-provided visual prompt region, guiding the optimization of the visual token modifier
An improved optimization strategy that improves convergence stability and gradient quality during test-time inference-time optimization of the visual prompt
A debiasing mechanism that identifies and suppresses language-driven attention biases that could override visual region signals during test-time optimization
A unified interface that converts diverse spatial prompt types (bounding boxes, segmentation masks, scribbles, point clicks) into a common representation for the energy function
Results
| Metric/Benchmark | Baseline (Frozen MLLM) | This Paper (ControlMLLM++) | Delta |
|---|---|---|---|
| Region-level visual reasoning accuracy | Lower (no spatial control) | Higher (test-time adapted) | Positive improvement |
| Out-of-domain generalization | Poor (task-specific fine-tuning needed) | Strong (training-free) | Qualitative gain |
| Visual prompt type coverage | Single modality (if any) | 4 types: box, mask, scribble, point | Broader coverage |
| Model modification required | Fine-tuning required for grounding | None (frozen weights) | Zero training cost |
Key Takeaways
- Test-time visual prompt optimization is a practical alternative to fine-tuning for adding spatial grounding capabilities to already-deployed MLLMs, reducing infrastructure and compute costs
- Cross-modal attention maps in MLLMs are semantically meaningful and can be used as both a diagnostic tool and an optimization signal for region-level control — practitioners should monitor these maps when debugging grounding failures
- Language prompt biases can actively interfere with visual region control in MLLMs; explicit debiasing (PromptDebias) is a necessary engineering consideration when designing test-time or inference-time adaptation systems
Abstract
We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.