MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

MINT-CoT introduces a framework that adaptively interleaves fine-grained visual tokens into mathematical chain-of-thought reasoning steps, enabling multimodal LLMs to dynamically attend to arbitrarily shaped visual regions during step-by-step problem solving.

Problem Statement

Extending Chain-of-Thought reasoning to multimodal math problems is difficult because existing approaches either treat images coarsely (bounding boxes only) or rely on external tools for visual manipulation. Current vision encoders also struggle with math-specific content like diagrams and equations, and there is no principled way to align reasoning steps with precise visual evidence at the token level.

Key Novelty

Introduction of the Interleave Token mechanism that dynamically selects and injects arbitrarily shaped visual regions into specific textual reasoning steps at the token level
Construction of the MINT-CoT dataset with 54K math problems where each reasoning step is aligned with precise visual regions, along with a rigorous data generation pipeline
A three-stage training strategy combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL to progressively build visual-linguistic reasoning capability

Evaluation Highlights

MINT-CoT-7B achieves +34.08% improvement over baseline on MathVista and +28.78% on GeoQA, demonstrating strong gains on geometry and general math visual benchmarks
MINT-CoT-7B improves by +23.2% on MMStar, indicating broader multimodal reasoning gains beyond purely mathematical tasks

Breakthrough Assessment

7/10 MINT-CoT represents a significant advance by moving beyond coarse bounding-box visual grounding and introducing token-level visual interleaving into reasoning chains, backed by a purpose-built dataset and a principled three-stage training recipe — yielding very large benchmark gains. However, it builds incrementally on existing CoT and visual grounding paradigms rather than redefining the fundamental architecture.

Methodology

Stage 1 — Text-only CoT SFT: Fine-tune the base model on text-only mathematical chain-of-thought data to establish strong symbolic reasoning priors before introducing visual signals
Stage 2 — Interleaved CoT SFT: Fine-tune on the 54K MINT-CoT dataset where each reasoning step is paired with token-level visual region annotations, teaching the model to emit Interleave Tokens that retrieve relevant visual patches of arbitrary shapes
Stage 3 — Interleaved CoT RL: Apply reinforcement learning on interleaved CoT trajectories to further optimize reasoning quality and visual selection accuracy, aligning the model toward correct final answers

System Components

Interleave Token

A special token inserted at reasoning steps that dynamically selects and injects visually relevant token regions of arbitrary shapes from the input math figure into the reasoning chain

MINT-CoT Dataset

54K mathematical problems with step-level alignment between textual reasoning steps and precise visual regions, constructed through an automated data generation pipeline

Three-Stage Training Strategy

Progressive training pipeline: text-only CoT SFT → interleaved CoT SFT → interleaved CoT RL, designed to build reasoning and visual grounding capabilities incrementally

Adaptive Visual Region Selector

Mechanism enabling free-form, non-rectangular visual region selection at the token level, overcoming the limitation of box-shaped crops used in prior work

Results

Benchmark	Baseline (approx.)	MINT-CoT-7B	Delta
MathVista	~baseline 7B	+34.08% relative	+34.08%
GeoQA	~baseline 7B	+28.78% relative	+28.78%
MMStar	~baseline 7B	+23.2% relative	+23.2%

Key Takeaways

Token-level visual interleaving in reasoning chains is a more effective paradigm than post-hoc image cropping or single-image conditioning for math VQA — practitioners building math tutoring or diagram-reasoning systems should adopt step-aligned visual grounding
Progressive training (text CoT → visual CoT SFT → visual CoT RL) is a practical recipe for injecting multimodal reasoning into existing LLMs without catastrophic forgetting of language reasoning ability
The 54K MINT-CoT dataset and open-source code provide a ready-made foundation for researchers to fine-tune smaller multimodal models on math reasoning tasks with visual evidence alignment, reducing the need to build new datasets from scratch

Abstract

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT