MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Problem Statement
Extending Chain-of-Thought reasoning to multimodal math problems is difficult because existing approaches either treat images coarsely (bounding boxes only) or rely on external tools for visual manipulation. Current vision encoders also struggle with math-specific content like diagrams and equations, and there is no principled way to align reasoning steps with precise visual evidence at the token level.
Key Novelty
- Introduction of the Interleave Token mechanism that dynamically selects and injects arbitrarily shaped visual regions into specific textual reasoning steps at the token level
- Construction of the MINT-CoT dataset with 54K math problems where each reasoning step is aligned with precise visual regions, along with a rigorous data generation pipeline
- A three-stage training strategy combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL to progressively build visual-linguistic reasoning capability
Evaluation Highlights
- MINT-CoT-7B achieves +34.08% improvement over baseline on MathVista and +28.78% on GeoQA, demonstrating strong gains on geometry and general math visual benchmarks
- MINT-CoT-7B improves by +23.2% on MMStar, indicating broader multimodal reasoning gains beyond purely mathematical tasks
Breakthrough Assessment
Methodology
- Stage 1 — Text-only CoT SFT: Fine-tune the base model on text-only mathematical chain-of-thought data to establish strong symbolic reasoning priors before introducing visual signals
- Stage 2 — Interleaved CoT SFT: Fine-tune on the 54K MINT-CoT dataset where each reasoning step is paired with token-level visual region annotations, teaching the model to emit Interleave Tokens that retrieve relevant visual patches of arbitrary shapes
- Stage 3 — Interleaved CoT RL: Apply reinforcement learning on interleaved CoT trajectories to further optimize reasoning quality and visual selection accuracy, aligning the model toward correct final answers
System Components
A special token inserted at reasoning steps that dynamically selects and injects visually relevant token regions of arbitrary shapes from the input math figure into the reasoning chain
54K mathematical problems with step-level alignment between textual reasoning steps and precise visual regions, constructed through an automated data generation pipeline
Progressive training pipeline: text-only CoT SFT → interleaved CoT SFT → interleaved CoT RL, designed to build reasoning and visual grounding capabilities incrementally
Mechanism enabling free-form, non-rectangular visual region selection at the token level, overcoming the limitation of box-shaped crops used in prior work
Results
| Benchmark | Baseline (approx.) | MINT-CoT-7B | Delta |
|---|---|---|---|
| MathVista | ~baseline 7B | +34.08% relative | +34.08% |
| GeoQA | ~baseline 7B | +28.78% relative | +28.78% |
| MMStar | ~baseline 7B | +23.2% relative | +23.2% |
Key Takeaways
- Token-level visual interleaving in reasoning chains is a more effective paradigm than post-hoc image cropping or single-image conditioning for math VQA — practitioners building math tutoring or diagram-reasoning systems should adopt step-aligned visual grounding
- Progressive training (text CoT → visual CoT SFT → visual CoT RL) is a practical recipe for injecting multimodal reasoning into existing LLMs without catastrophic forgetting of language reasoning ability
- The 54K MINT-CoT dataset and open-source code provide a ready-made foundation for researchers to fine-tune smaller multimodal models on math reasoning tasks with visual evidence alignment, reducing the need to build new datasets from scratch
Abstract
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT