When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

MIRA is a benchmark that evaluates multimodal models on tasks requiring intermediate visual image generation (sketches, diagrams, path drawings) as part of the reasoning chain, mirroring the human 'drawing to think' cognitive process. It demonstrates that visual chain-of-thought reasoning significantly outperforms purely textual reasoning approaches on spatially and structurally complex problems.

Problem Statement

Current chain-of-thought benchmarks exclusively evaluate text-based intermediate reasoning steps, ignoring scenarios where visual intermediates are cognitively necessary for problem-solving. Existing multimodal LLMs lack a principled way to be evaluated on their ability to generate and leverage visual reasoning artifacts. This gap means models are never tested on tasks where language alone is fundamentally insufficient to express critical spatial or structural reasoning steps.

Key Novelty

First benchmark (546 problems) specifically designed to require intermediate visual image generation as part of the reasoning process, with annotated visual clues and final answers
Three-level unified evaluation protocol: direct input, text-only CoT, and Visual-CoT with image clues plus textual prompts, enabling systematic ablation of visual reasoning contributions
Empirical demonstration that visual intermediate cues yield an average 33.7% relative performance gain across all tested models, while expanded search spaces and text-aligned prompts provide only marginal improvements

Evaluation Highlights

Visual-CoT input condition yields an average relative gain of 33.7% over text-only prompting across all models and tasks, including strongest private and open-weight multimodal LLMs
Pass@k and majority voting metrics probe model upper bounds, revealing that even expanded search spaces and Visual-CoT-aligned textual prompts cannot close the gap with actual visual intermediate cues

Signal Assessment

7/10 MIRA opens a genuinely new evaluation axis for multimodal reasoning by formalizing 'visual chain-of-thought' as a first-class capability, exposing a critical blind spot in current MLLM evaluation and development. While it is a benchmark paper rather than a new model or training method, the conceptual framing and empirical evidence for the necessity of visual intermediates represent a significant advance for the field.

Methodology

Curate 546 multimodal problems intrinsically requiring spatial, structural, or relational reasoning that is difficult to express purely in text, and annotate each with intermediate visual artifacts (sketches, diagrams, path drawings) and ground-truth final answers
Define three evaluation tiers: (1) direct image+question input, (2) text-only CoT with thinking prompts, and (3) Visual-CoT with both annotated image clues and textual thinking prompts, to isolate the contribution of visual intermediates
Evaluate a suite of state-of-the-art private and open-weight multimodal LLMs under all three conditions, and probe upper bounds via pass@k and majority voting under varying k, plus text prompt variants aligned with Visual-CoT structure

System Components

MIRA Problem Set

546 curated multimodal problems involving complex spatial relationships, structural diagrams, and path-based reasoning, each requiring intermediate visual generation to solve effectively

Intermediate Visual Annotations

Human-annotated visual artifacts (sketches, structural diagrams, path drawings) paired with each problem, serving as ground-truth visual chain-of-thought steps

Three-Level Evaluation Protocol

Structured evaluation framework spanning direct input, text-only CoT, and Visual-CoT conditions to precisely measure the marginal value of visual intermediate reasoning

Upper-Bound Probing Suite

Pass@k and majority voting evaluations combined with text-aligned Visual-CoT prompt variants, used to characterize the ceiling of model performance under different search strategies

Results

Condition	Text-only CoT	Visual-CoT Input	Relative Gain
All models & tasks (avg)	Baseline	Higher accuracy	+33.7% relative
Expanded search space (pass@k)	Marginal gain over text CoT	Still below Visual-CoT	Limited improvement
Text prompts aligned to Visual-CoT	Marginal gain over text CoT	Still below Visual-CoT	Limited improvement

Key Takeaways

Multimodal LLM developers should invest in training models to generate and consume intermediate visual artifacts (sketches, diagrams) as part of reasoning pipelines, not just text-based chain-of-thought
MIRA provides a concrete, quality-controlled benchmark to measure Visual-CoT capability; practitioners evaluating spatial or structural reasoning tasks should adopt its three-level protocol to distinguish text reasoning limits from visual reasoning potential
Neither scaling inference compute (pass@k, majority voting) nor carefully engineering text prompts to mimic visual reasoning can substitute for actual visual intermediates, suggesting this is a capability gap requiring architectural or training-level solutions

Abstract

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.