FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Problem Statement
Despite rapid advances in MLLMs, there is no rigorous standardized benchmark for evaluating their few-shot learning capabilities under interleaved image-text conditions. Existing evaluations largely focus on zero-shot performance, leaving open questions about how models respond to demonstrations and reasoning augmentation. This gap makes it difficult to diagnose whether instruction-tuned MLLMs truly benefit from in-context learning or chain-of-thought reasoning.
Key Novelty
- Introduction of FewMMBench, the first dedicated benchmark specifically designed to evaluate MLLMs across zero-shot, few-shot, and CoT-augmented few-shot settings with systematic controls
- Broad coverage of 26 open-weight MLLMs from six model families, enabling cross-model and cross-family comparative analysis on few-shot multimodal tasks
- Diverse task suite spanning attribute recognition to temporal reasoning, combined with evaluation of retrieval-based demonstration selection and varying context sizes
Evaluation Highlights
- Instruction-tuned models show strong zero-shot performance but exhibit minimal gains or even regression when provided with additional demonstrations or CoT reasoning prompts
- Retrieval-based demonstration selection and increased context window sizes yield limited performance improvements across evaluated models, suggesting few-shot ICL remains an open challenge for MLLMs
Breakthrough Assessment
Methodology
- Curate a diverse benchmark covering multiple multimodal understanding task types (e.g., attribute recognition, temporal reasoning) with structured few-shot evaluation splits and ground-truth annotations
- Design evaluation protocols for zero-shot, k-shot ICL (with random and retrieval-based demonstration selection), and CoT-augmented few-shot prompting, varying context sizes systematically
- Evaluate 26 open-weight MLLMs from six model families across all protocols, analyzing performance trends by task type, model family, instruction-tuning status, and prompting strategy
System Components
A curated multimodal benchmark with tasks ranging from attribute recognition to temporal reasoning, formatted to support interleaved image-text few-shot evaluation
Systematic zero-shot and k-shot in-context learning evaluation with both random and retrieval-based demonstration selection strategies
Chain-of-Thought prompting integrated into few-shot settings to assess whether explicit reasoning steps improve MLLM performance on multimodal tasks
Standardized evaluation pipeline covering 26 open-weight MLLMs across six model families, enabling controlled cross-model comparison
Results
| Setting | Zero-Shot (Instruction-Tuned) | Few-Shot ICL | CoT-Augmented Few-Shot |
|---|---|---|---|
| Instruction-tuned models | Strong baseline | Minimal gain or regression | Minimal gain or regression |
| Retrieval-based demos | N/A | Limited improvement over random | Limited improvement |
| Increased context size | N/A | Limited gains | Limited gains |
Key Takeaways
- Practitioners should not assume that adding more demonstrations or CoT prompting will improve instruction-tuned MLLM performance — these models may already be saturated by instruction tuning and can regress with added context
- Retrieval-based demonstration selection, while theoretically promising, provides only marginal benefits over random selection for current MLLMs, suggesting better demonstration selection strategies are needed
- FewMMBench provides a ready-to-use diagnostic tool for benchmarking new MLLMs on few-shot multimodal tasks, making it valuable for model developers seeking to identify weaknesses in ICL capabilities
Abstract
As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench