← Back to Papers

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem
2026
FewMMBench is a comprehensive benchmark for systematically evaluating multimodal large language models under few-shot conditions, focusing on In-Context Learning and Chain-of-Thought prompting across diverse multimodal understanding tasks.

Problem Statement

Despite rapid advances in MLLMs, there is no rigorous standardized benchmark for evaluating their few-shot learning capabilities under interleaved image-text conditions. Existing evaluations largely focus on zero-shot performance, leaving open questions about how models respond to demonstrations and reasoning augmentation. This gap makes it difficult to diagnose whether instruction-tuned MLLMs truly benefit from in-context learning or chain-of-thought reasoning.

Key Novelty

  • Introduction of FewMMBench, the first dedicated benchmark specifically designed to evaluate MLLMs across zero-shot, few-shot, and CoT-augmented few-shot settings with systematic controls
  • Broad coverage of 26 open-weight MLLMs from six model families, enabling cross-model and cross-family comparative analysis on few-shot multimodal tasks
  • Diverse task suite spanning attribute recognition to temporal reasoning, combined with evaluation of retrieval-based demonstration selection and varying context sizes

Evaluation Highlights

  • Instruction-tuned models show strong zero-shot performance but exhibit minimal gains or even regression when provided with additional demonstrations or CoT reasoning prompts
  • Retrieval-based demonstration selection and increased context window sizes yield limited performance improvements across evaluated models, suggesting few-shot ICL remains an open challenge for MLLMs

Breakthrough Assessment

5/10 FewMMBench is a solid and timely contribution that fills a real gap in MLLM evaluation infrastructure, but it is primarily a benchmarking paper rather than a methodological advance, and its findings (that ICL/CoT help less than expected) are more diagnostic than transformative.

Methodology

  1. Curate a diverse benchmark covering multiple multimodal understanding task types (e.g., attribute recognition, temporal reasoning) with structured few-shot evaluation splits and ground-truth annotations
  2. Design evaluation protocols for zero-shot, k-shot ICL (with random and retrieval-based demonstration selection), and CoT-augmented few-shot prompting, varying context sizes systematically
  3. Evaluate 26 open-weight MLLMs from six model families across all protocols, analyzing performance trends by task type, model family, instruction-tuning status, and prompting strategy

System Components

FewMMBench Dataset

A curated multimodal benchmark with tasks ranging from attribute recognition to temporal reasoning, formatted to support interleaved image-text few-shot evaluation

ICL Evaluation Protocol

Systematic zero-shot and k-shot in-context learning evaluation with both random and retrieval-based demonstration selection strategies

CoT-Augmented Prompting

Chain-of-Thought prompting integrated into few-shot settings to assess whether explicit reasoning steps improve MLLM performance on multimodal tasks

Multi-Model Evaluation Suite

Standardized evaluation pipeline covering 26 open-weight MLLMs across six model families, enabling controlled cross-model comparison

Results

Setting Zero-Shot (Instruction-Tuned) Few-Shot ICL CoT-Augmented Few-Shot
Instruction-tuned models Strong baseline Minimal gain or regression Minimal gain or regression
Retrieval-based demos N/A Limited improvement over random Limited improvement
Increased context size N/A Limited gains Limited gains

Key Takeaways

  • Practitioners should not assume that adding more demonstrations or CoT prompting will improve instruction-tuned MLLM performance — these models may already be saturated by instruction tuning and can regress with added context
  • Retrieval-based demonstration selection, while theoretically promising, provides only marginal benefits over random selection for current MLLMs, suggesting better demonstration selection strategies are needed
  • FewMMBench provides a ready-to-use diagnostic tool for benchmarking new MLLMs on few-shot multimodal tasks, making it valuable for model developers seeking to identify weaknesses in ICL capabilities

Abstract

As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench

Generated on 2026-03-02 using Claude