← Back to Papers

GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

Sahiti Yerramilli, Nilay Pande, Rynaa Grover, Jayant Sravan Tamarapalli
Conference on Empirical Methods in Natural Language Processing | 2025
GeoChain introduces a large-scale multimodal benchmark with 1.46M street-level images paired with 21-step chain-of-thought question sequences to systematically evaluate step-by-step geographic reasoning in MLLMs. It diagnoses specific failure modes across visual, spatial, cultural, and precise geolocation reasoning categories.

Problem Statement

Existing benchmarks for MLLMs lack structured, step-by-step evaluation of geographic reasoning, making it difficult to pinpoint where models fail in the localization pipeline. Geographic reasoning requires integrating visual grounding, spatial understanding, and cultural knowledge in a progressive manner that flat Q&A benchmarks cannot capture. There is no large-scale diagnostic tool that annotates difficulty, provides semantic segmentation context, and measures fine-grained localization performance across reasoning complexity levels.

Key Novelty

  • Massive-scale benchmark: 1.46M Mapillary street-level images paired with 21-step CoT question sequences yielding over 30 million Q&A pairs, far exceeding prior geo-reasoning datasets in scale and structure
  • Four-category reasoning taxonomy (visual, spatial, cultural, precise geolocation) with difficulty annotations and a novel 'visual locatability score' to quantify how identifiable an image is
  • Semantic segmentation enrichment with 150 classes per image, enabling fine-grained analysis of the relationship between scene content and model reasoning performance

Evaluation Highlights

  • Benchmarking of GPT-4.1 variants, Claude 3.7, and Gemini 2.5 variants on a 2,088-image diverse subset reveals consistent weaknesses in visual grounding and erratic reasoning chains, especially at higher CoT steps
  • All evaluated state-of-the-art MLLMs struggle with precise geolocation accuracy as reasoning complexity escalates, indicating a systematic gap between coarse attribute recognition and fine-grained localization

Breakthrough Assessment

6/10 GeoChain is a solid and well-scoped benchmark contribution that fills a clear gap in structured geographic reasoning evaluation for MLLMs, but it is primarily a diagnostic dataset paper rather than a new modeling advance; its impact depends on community adoption to drive future model improvements.

Methodology

  1. Curate 1.46M diverse street-level images from Mapillary, enriching each with semantic segmentation (150 classes) and computing a visual locatability score reflecting how geographically identifiable each image is
  2. Generate 21-step chain-of-thought Q&A sequences per image using a structured taxonomy progressing from coarse visual attributes to fine-grained geolocation, annotated by difficulty and reasoning category
  3. Benchmark frontier MLLMs (GPT-4.1, Claude 3.7, Gemini 2.5) on a representative 2,088-image subset, analyzing performance degradation across reasoning steps, categories, and difficulty levels

System Components

CoT Question Sequences

21-step progressively complex Q&A chains per image guiding models from broad visual observations to precise geographic localization

Four Reasoning Categories

Taxonomy covering visual (scene understanding), spatial (layout/orientation), cultural (contextual/regional cues), and precise geolocation (exact location prediction)

Visual Locatability Score

A novel per-image metric quantifying how geographically identifiable an image is based on its visual content, used to stratify benchmark difficulty

Semantic Segmentation Annotations

150-class pixel-level segmentation labels enriching each image to enable analysis of scene composition versus reasoning performance

Difficulty Annotations

Per-question difficulty labels across the CoT steps enabling fine-grained diagnostics of where and why models fail

Results

Aspect Prior MLLM Geo-Benchmarks GeoChain Evaluation Delta
Scale (images) ~thousands 1.46 million Orders of magnitude larger
Q&A pairs tens of thousands 30+ million Massive scale increase
Reasoning structure Flat single-turn QA 21-step CoT sequences Structured progressive evaluation
GPT-4.1 / Gemini 2.5 precise geolocation Not systematically evaluated Consistent failures at high complexity Reveals systematic gap
Visual grounding accuracy (top models) Assumed strong Frequent weaknesses identified Diagnostic improvement

Key Takeaways

  • Even frontier models like GPT-4.1 and Gemini 2.5 Pro exhibit erratic reasoning and poor visual grounding when geographic reasoning is decomposed into progressive steps — practitioners should not assume strong CoT coherence in geo-tasks
  • The visual locatability score and semantic segmentation metadata make GeoChain useful not just as a benchmark but as a tool for curriculum learning or hard-negative mining in training geolocation-capable MLLMs
  • The four-category taxonomy (visual, spatial, cultural, precise geolocation) provides a practical diagnostic framework for ML teams building or evaluating location-aware multimodal systems, enabling targeted identification of failure modes rather than aggregate accuracy scores

Abstract

This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.

Generated on 2026-03-03 using Claude