VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

VCR-Bench is a comprehensive benchmark for evaluating Video Chain-of-Thought reasoning in large vision-language models, introducing stepwise CoT rationales tagged by perception vs. reasoning capability to diagnose failure modes.

Problem Statement

Existing video benchmarks only measure final answer accuracy without assessing the quality of the reasoning process, making it impossible to distinguish whether model failures stem from perceptual deficits or reasoning deficiencies. There is no standardized framework to evaluate CoT reasoning specifically in the video domain, leaving a critical gap in understanding LVLM capabilities for complex temporal-spatial tasks.

Key Novelty

Introduction of VCR-Bench with 859 videos and 1,034 QA pairs, each manually annotated with stepwise CoT rationales where every step is tagged as either perception-related or reasoning-related
A novel CoT Score metric that evaluates the entire chain-of-thought process step-by-step rather than only final answer correctness, enabling fine-grained process-level assessment
Seven distinct task dimensions that span diverse video content and durations, enabling systematic decomposition of LVLM performance across different reasoning categories

Evaluation Highlights

Top-performing model (o1) achieves only 62.8% CoT Score and 56.7% accuracy, while most models score below 40%, revealing substantial headroom for improvement
Most models score lower on perception steps than reasoning steps, identifying temporal-spatial information processing as the primary bottleneck in video CoT reasoning; a strong positive correlation between CoT Score and accuracy validates the metric

Breakthrough Assessment

6/10 VCR-Bench makes a solid, well-scoped contribution by introducing principled process-level evaluation for video reasoning and surfacing the perception-vs-reasoning bottleneck, but it is primarily a benchmark/evaluation paper rather than a methodological or architectural advance.

Methodology

Curate 859 videos across diverse content types and durations, then construct 1,034 high-quality QA pairs covering seven task dimensions relevant to complex video understanding
Manually annotate each QA pair with a stepwise CoT rationale, tagging every individual step as either a perception step (extracting visual/temporal information) or a reasoning step (logical inference from extracted information)
Evaluate LVLMs by computing both a final accuracy score and the novel CoT Score—which grades model-generated reasoning chains against the stepwise tagged ground-truth rationales—then analyze per-tag performance to diagnose model bottlenecks

System Components

VCR-Bench Dataset

859 videos with 1,034 QA pairs spanning diverse content and durations, designed to stress-test complex video reasoning

Stepwise CoT Rationale Annotation

Manual annotation of each QA pair with a multi-step reasoning chain where every step is labeled as either a perception step or a reasoning step

CoT Score

A process-level evaluation metric that scores the quality of a model's full chain-of-thought by comparing it against the stepwise tagged ground-truth rationale, distinct from simple final-answer accuracy

Seven Task Dimensions

A taxonomy of distinct video reasoning task types used to evaluate LVLM performance across varied cognitive and perceptual demands

Perception vs. Reasoning Diagnostic Split

A per-tag analysis that separates model performance on perception-dependent steps from reasoning-dependent steps to pinpoint specific capability gaps

Results

Model	CoT Score	Accuracy	Notes
o1 (top performer)	62.8%	56.7%	Best on benchmark
Most other LVLMs	<40%	<40%	Substantial gap to top model
Perception steps (avg)	Lower	—	Primary bottleneck identified
Reasoning steps (avg)	Higher	—	Relatively stronger than perception

Key Takeaways

Perception of temporal-spatial information—not high-level reasoning—is the primary bottleneck for current LVLMs on complex video tasks, so practitioners should prioritize improving video frame sampling, temporal grounding, and spatial attention mechanisms
Evaluating only final-answer accuracy is insufficient for video reasoning; process-level metrics like CoT Score provide more actionable diagnostics and better correlate with true model capability
Even the best available model (o1) leaves ~37% CoT Score headroom on VCR-Bench, indicating that video chain-of-thought reasoning remains a far-from-solved problem and a productive research frontier

Abstract

The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.