← Back to Papers

MiMo-Audio: Audio Language Models are Few-Shot Learners

X. Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shu-Qin Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xi-Na Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qi-ying Wang, Sirui Deng, Weimin Xiong, Wen Huang, Wenyu Yang, Yilin Jiang, Yixin Yang, Yu-Shi Tian, Yue Ma, Yue Yu, Zihan Zhang, Zihao Yue, Bangjun Xiao, Bin Xia, Bofei Gao, Bowen Ye, Can Cai, Chang Liu, Chenhong He, Chunan Li, Dawei Zhu, Duo Zhang, Fengyuan Shi, Guoan Wang, Hailin Zhang, Hanglong Lv, Hanyu Li, Hao Tian, Hengxu Qu, Hong-Mei Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jia Zuo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Linghao Zhang, Meng Chen, Nuo Chen, Peng Zhang, Qian Chen, Qiantong Wang, Rang Li, Shao-yang Liu, Shengfan Wang, Shicheng Li, Shi-liang Yu, Shijie Cao, Shimao Chen, Shuhao Gu, Weikun Wang, Wen-Juan Ma, Xia Deng, Xing Yong, Xing Zhang, Xu Wang, Yi-Hao Song, Yihao Zhao, Yingbo Zhao, Yizhao Gao, Yu Cheng, Yuanfang Tu, Yudong Wang, Zhaojun Huang, Zheng-Yu Tang, Zhenrui Lin, Zhichao Song, Zhi-Yue Xu, Zhixian Zheng, Zi-Cheng Jiang
arXiv.org | 2025
MiMo-Audio demonstrates that scaling next-token prediction pretraining on over 100 million hours of audio data enables few-shot generalization across diverse audio tasks, mirroring GPT-3's paradigm shift in NLP but applied to the audio domain.

Problem Statement

Current audio language models require task-specific fine-tuning for each application, making them brittle and unable to generalize to novel tasks without additional training data and compute. This contrasts sharply with human auditory cognition, which generalizes from just a few examples. The field lacks a unified audio foundation model capable of emergent few-shot learning across speech intelligence, audio understanding, and generation tasks simultaneously.

Key Novelty

  • Scaling audio language model pretraining to over 100 million hours of data to elicit emergent few-shot learning capabilities across a wide spectrum of audio tasks without task-specific fine-tuning
  • Demonstration of zero/few-shot generalization to tasks absent from training data, including voice conversion, style transfer, speech editing, and realistic speech continuation (talk shows, debates, livestreaming)
  • Integration of explicit thinking/reasoning mechanisms into both audio understanding and generation at the post-training (instruction-tuning) stage, yielding MiMo-Audio-7B-Instruct that rivals closed-source models

Evaluation Highlights

  • MiMo-Audio-7B-Base achieves open-source SOTA on speech intelligence and audio understanding benchmarks, generalizing to unseen tasks like voice conversion and style transfer without fine-tuning
  • MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio), and instruct-TTS evaluations, approaching or surpassing closed-source models

Breakthrough Assessment

8/10 Applying the GPT-3 scaling paradigm to audio with 100M+ hours of pretraining data to achieve emergent few-shot audio learning is a significant architectural and empirical advance, demonstrating cross-task generalization previously unseen in open-source audio models. It falls short of a full paradigm shift as it largely adapts an established NLP recipe rather than introducing fundamentally new learning principles.

Methodology

  1. Pretrain a 7B-parameter audio language model (MiMo-Audio-7B-Base) using next-token prediction on over 100 million hours of diverse audio data, enabling emergent few-shot learning across speech and audio tasks
  2. Conduct systematic evaluation of few-shot generalization capabilities on both standard benchmarks and held-out tasks (voice conversion, style transfer, speech editing, speech continuation) to characterize emergent abilities
  3. Perform post-training via a curated diverse instruction-tuning corpus and introduce chain-of-thought / thinking mechanisms for both audio understanding and generation to produce MiMo-Audio-7B-Instruct

System Components

MiMo-Audio-7B-Base

Large-scale pretrained audio language model trained on 100M+ hours of audio using next-token prediction, serving as the few-shot generalist foundation

Massive Audio Pretraining Corpus

Over one hundred million hours of diverse audio data covering speech, music, environmental sounds, and conversational content used for scalable pretraining

Systematic Few-Shot Evaluation Suite

A comprehensive benchmark suite assessing few-shot generalization across standard (MMAU, MMSU, MMAR, MMAU-Pro) and novel held-out audio tasks

Thinking Mechanism

Chain-of-thought style reasoning integrated at post-training stage into both audio understanding and generation pipelines to improve complex task performance

MiMo-Audio-7B-Instruct

Instruction-tuned variant of the base model using a diverse curated corpus, achieving SOTA on spoken dialogue, audio QA, and instruct-TTS benchmarks

Results

Benchmark Prior Open-Source SOTA MiMo-Audio-7B Delta
MMAU (Audio Understanding) Previous open-source best New open-source SOTA SOTA improvement
MMSU (Speech Understanding) Previous open-source best New open-source SOTA SOTA improvement
MMAR Previous open-source best New open-source SOTA SOTA improvement
MMAU-Pro Previous open-source best New open-source SOTA SOTA improvement
Big Bench Audio (Dialogue) Previous open-source best New open-source SOTA SOTA improvement
Instruct-TTS Evaluation Previous open-source best Approaches/surpasses closed-source Significant improvement
Zero-shot Voice Conversion Not supported (requires fine-tuning) Supported without fine-tuning Emergent capability

Key Takeaways

  • Scaling audio pretraining data to 100M+ hours is sufficient to elicit GPT-3-like emergent few-shot generalization in audio models, suggesting practitioners should prioritize data scale over task-specific architectures when building general-purpose audio systems
  • Integrating chain-of-thought reasoning (thinking mechanisms) at the instruction-tuning stage meaningfully improves both audio understanding and generation quality, making it a worthwhile post-training technique for audio LLMs
  • Open-source models can now approach or match closed-source audio model performance; MiMo-Audio's released checkpoints and evaluation suite provide a strong baseline and benchmarking infrastructure for the community

Abstract

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

Generated on 2026-02-21 using Claude