← Back to Papers

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, Hao Fei
arXiv.org | 2025
This paper presents the first comprehensive systematic survey of Multimodal Chain-of-Thought (MCoT) reasoning, cataloging methodologies, taxonomies, and applications across diverse modalities including image, video, speech, audio, 3D, and structured data. It aims to unify the fragmented landscape of MCoT research and guide future progress toward multimodal AGI.

Problem Statement

Chain-of-thought reasoning has proven powerful in LLMs, but extending it to multimodal contexts introduces unique challenges across heterogeneous data types that lack a unified framework. Existing MCoT studies are scattered across domains without a consolidated review, making it difficult for researchers to understand the state-of-the-art, identify gaps, or build on prior work systematically. The absence of an up-to-date survey leaves practitioners without a roadmap for applying MCoT techniques to real-world multimodal applications.

Key Novelty

  • First systematic and comprehensive survey dedicated specifically to Multimodal Chain-of-Thought (MCoT) reasoning, covering all major modalities and integration with MLLMs
  • A comprehensive taxonomy of MCoT methodologies organized from diverse perspectives including modality type, reasoning paradigm, and application domain
  • Identification and synthesis of open challenges and future research directions for MCoT toward multimodal AGI, including robotics, healthcare, autonomous driving, and generation

Evaluation Highlights

  • Qualitative coverage of MCoT success across six modality types: image, video, speech, audio, 3D, and structured data
  • Qualitative assessment of MCoT application breadth spanning robotics, healthcare, autonomous driving, and multimodal generation domains

Breakthrough Assessment

5/10 As a survey paper, this is a solid and timely contribution that consolidates a rapidly evolving field and provides a structured taxonomy, but it does not introduce novel algorithms or empirical advances itself. Its value lies in organization and synthesis rather than a technical breakthrough.

Methodology

  1. Step 1: Define foundational concepts and terminology of CoT reasoning and its extension to multimodal contexts, establishing a common vocabulary for MCoT
  2. Step 2: Construct a comprehensive taxonomy categorizing existing MCoT methods by modality (image, video, speech, audio, 3D, structured data), methodology type, and application scenario
  3. Step 3: Analyze current challenges, open problems, and future research directions to guide practitioners and researchers toward impactful contributions in multimodal AGI

System Components

Foundational Concepts & Definitions

Establishes unified terminology for CoT and MCoT reasoning, clarifying how step-by-step reasoning extends from text-only to multimodal settings

Comprehensive Taxonomy

Organizes MCoT methods across modalities (image, video, speech, audio, 3D, structured data) and paradigms, enabling structured comparison and navigation of the literature

Application Scenario Analysis

Reviews MCoT applications in robotics, healthcare, autonomous driving, and multimodal generation, mapping methods to real-world use cases

Challenges & Future Directions

Identifies unresolved problems and opportunities in MCoT research to foster targeted innovation toward multimodal AGI

Results

Domain/Modality Pre-MCoT Status MCoT Achievement Impact
Image Reasoning Single-step visual QA Step-by-step visual CoT with MLLMs Improved interpretability & accuracy
Video Reasoning Frame-level understanding Temporal CoT reasoning chains Better temporal coherence
Robotics Reactive control MCoT-guided planning & manipulation Enhanced task generalization
Healthcare Single-modality diagnosis Multi-step multimodal clinical reasoning More explainable decisions
Autonomous Driving Perception-only pipelines Reasoning chains over sensor fusion Improved safety reasoning

Key Takeaways

  • ML practitioners building multimodal reasoning systems should leverage MCoT frameworks to decompose complex cross-modal tasks into interpretable step-by-step chains, significantly improving both performance and explainability
  • The taxonomy provided in this survey offers a practical reference for selecting appropriate MCoT methodologies based on modality type and application domain, reducing redundant exploration
  • Identified open challenges—such as cross-modal alignment in reasoning chains, hallucination in MLLMs, and scalable supervision for MCoT—represent high-impact research directions for NLP and CV practitioners working on next-generation multimodal systems

Abstract

By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.

Generated on 2026-02-21 using Claude