FewMMBench: A Benchmark for Multimodal Few-Shot Learning

FewMMBench is a comprehensive benchmark for systematically evaluating multimodal large language models under few-shot conditions, focusing on In-Context Learning and Chain-of-Thought prompting across diverse multimodal understanding tasks.

Problem Statement

Despite rapid advances in MLLMs, there is no rigorous standardized benchmark for evaluating their few-shot learning capabilities under interleaved image-text conditions. Existing evaluations largely focus on zero-shot performance, leaving open questions about how models respond to demonstrations and reasoning augmentation. This gap makes it difficult to diagnose whether instruction-tuned MLLMs truly benefit from in-context learning or chain-of-thought reasoning.

Key Novelty

Introduction of FewMMBench, the first dedicated benchmark specifically designed to evaluate MLLMs across zero-shot, few-shot, and CoT-augmented few-shot settings with systematic controls
Broad coverage of 26 open-weight MLLMs from six model families, enabling cross-model and cross-family comparative analysis on few-shot multimodal tasks
Diverse task suite spanning attribute recognition to temporal reasoning, combined with evaluation of retrieval-based demonstration selection and varying context sizes

Evaluation Highlights

Instruction-tuned models show strong zero-shot performance but exhibit minimal gains or even regression when provided with additional demonstrations or CoT reasoning prompts
Retrieval-based demonstration selection and increased context window sizes yield limited performance improvements across evaluated models, suggesting few-shot ICL remains an open challenge for MLLMs

Breakthrough Assessment

5/10 FewMMBench is a solid and timely contribution that fills a real gap in MLLM evaluation infrastructure, but it is primarily a benchmarking paper rather than a methodological advance, and its findings (that ICL/CoT help less than expected) are more diagnostic than transformative.

Methodology

Curate a diverse benchmark covering multiple multimodal understanding task types (e.g., attribute recognition, temporal reasoning) with structured few-shot evaluation splits and ground-truth annotations
Design evaluation protocols for zero-shot, k-shot ICL (with random and retrieval-based demonstration selection), and CoT-augmented few-shot prompting, varying context sizes systematically
Evaluate 26 open-weight MLLMs from six model families across all protocols, analyzing performance trends by task type, model family, instruction-tuning status, and prompting strategy

System Components

FewMMBench Dataset

A curated multimodal benchmark with tasks ranging from attribute recognition to temporal reasoning, formatted to support interleaved image-text few-shot evaluation

ICL Evaluation Protocol

Systematic zero-shot and k-shot in-context learning evaluation with both random and retrieval-based demonstration selection strategies

CoT-Augmented Prompting

Chain-of-Thought prompting integrated into few-shot settings to assess whether explicit reasoning steps improve MLLM performance on multimodal tasks

Multi-Model Evaluation Suite

Standardized evaluation pipeline covering 26 open-weight MLLMs across six model families, enabling controlled cross-model comparison

Results

Setting	Zero-Shot (Instruction-Tuned)	Few-Shot ICL	CoT-Augmented Few-Shot
Instruction-tuned models	Strong baseline	Minimal gain or regression	Minimal gain or regression
Retrieval-based demos	N/A	Limited improvement over random	Limited improvement
Increased context size	N/A	Limited gains	Limited gains

Key Takeaways

Practitioners should not assume that adding more demonstrations or CoT prompting will improve instruction-tuned MLLM performance — these models may already be saturated by instruction tuning and can regress with added context
Retrieval-based demonstration selection, while theoretically promising, provides only marginal benefits over random selection for current MLLMs, suggesting better demonstration selection strategies are needed
FewMMBench provides a ready-to-use diagnostic tool for benchmarking new MLLMs on few-shot multimodal tasks, making it valuable for model developers seeking to identify weaknesses in ICL capabilities

Abstract

As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench