When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Problem Statement
Current chain-of-thought benchmarks exclusively evaluate text-based intermediate reasoning steps, ignoring scenarios where visual intermediates are cognitively necessary for problem-solving. Existing multimodal LLMs lack a principled way to be evaluated on their ability to generate and leverage visual reasoning artifacts. This gap means models are never tested on tasks where language alone is fundamentally insufficient to express critical spatial or structural reasoning steps.
Key Novelty
- First benchmark (546 problems) specifically designed to require intermediate visual image generation as part of the reasoning process, with annotated visual clues and final answers
- Three-level unified evaluation protocol: direct input, text-only CoT, and Visual-CoT with image clues plus textual prompts, enabling systematic ablation of visual reasoning contributions
- Empirical demonstration that visual intermediate cues yield an average 33.7% relative performance gain across all tested models, while expanded search spaces and text-aligned prompts provide only marginal improvements
Evaluation Highlights
- Visual-CoT input condition yields an average relative gain of 33.7% over text-only prompting across all models and tasks, including strongest private and open-weight multimodal LLMs
- Pass@k and majority voting metrics probe model upper bounds, revealing that even expanded search spaces and Visual-CoT-aligned textual prompts cannot close the gap with actual visual intermediate cues
Breakthrough Assessment
Methodology
- Curate 546 multimodal problems intrinsically requiring spatial, structural, or relational reasoning that is difficult to express purely in text, and annotate each with intermediate visual artifacts (sketches, diagrams, path drawings) and ground-truth final answers
- Define three evaluation tiers: (1) direct image+question input, (2) text-only CoT with thinking prompts, and (3) Visual-CoT with both annotated image clues and textual thinking prompts, to isolate the contribution of visual intermediates
- Evaluate a suite of state-of-the-art private and open-weight multimodal LLMs under all three conditions, and probe upper bounds via pass@k and majority voting under varying k, plus text prompt variants aligned with Visual-CoT structure
System Components
546 curated multimodal problems involving complex spatial relationships, structural diagrams, and path-based reasoning, each requiring intermediate visual generation to solve effectively
Human-annotated visual artifacts (sketches, structural diagrams, path drawings) paired with each problem, serving as ground-truth visual chain-of-thought steps
Structured evaluation framework spanning direct input, text-only CoT, and Visual-CoT conditions to precisely measure the marginal value of visual intermediate reasoning
Pass@k and majority voting evaluations combined with text-aligned Visual-CoT prompt variants, used to characterize the ceiling of model performance under different search strategies
Results
| Condition | Text-only CoT | Visual-CoT Input | Relative Gain |
|---|---|---|---|
| All models & tasks (avg) | Baseline | Higher accuracy | +33.7% relative |
| Expanded search space (pass@k) | Marginal gain over text CoT | Still below Visual-CoT | Limited improvement |
| Text prompts aligned to Visual-CoT | Marginal gain over text CoT | Still below Visual-CoT | Limited improvement |
Key Takeaways
- Multimodal LLM developers should invest in training models to generate and consume intermediate visual artifacts (sketches, diagrams) as part of reasoning pipelines, not just text-based chain-of-thought
- MIRA provides a concrete, quality-controlled benchmark to measure Visual-CoT capability; practitioners evaluating spatial or structural reasoning tasks should adopt its three-level protocol to distinguish text reasoning limits from visual reasoning potential
- Neither scaling inference compute (pass@k, majority voting) nor carefully engineering text prompts to mimic visual reasoning can substitute for actual visual intermediates, suggesting this is a capability gap requiring architectural or training-level solutions
Abstract
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.