MiMo-Audio: Audio Language Models are Few-Shot Learners
Problem Statement
Current audio language models require task-specific fine-tuning for each application, making them brittle and unable to generalize to novel tasks without additional training data and compute. This contrasts sharply with human auditory cognition, which generalizes from just a few examples. The field lacks a unified audio foundation model capable of emergent few-shot learning across speech intelligence, audio understanding, and generation tasks simultaneously.
Key Novelty
- Scaling audio language model pretraining to over 100 million hours of data to elicit emergent few-shot learning capabilities across a wide spectrum of audio tasks without task-specific fine-tuning
- Demonstration of zero/few-shot generalization to tasks absent from training data, including voice conversion, style transfer, speech editing, and realistic speech continuation (talk shows, debates, livestreaming)
- Integration of explicit thinking/reasoning mechanisms into both audio understanding and generation at the post-training (instruction-tuning) stage, yielding MiMo-Audio-7B-Instruct that rivals closed-source models
Evaluation Highlights
- MiMo-Audio-7B-Base achieves open-source SOTA on speech intelligence and audio understanding benchmarks, generalizing to unseen tasks like voice conversion and style transfer without fine-tuning
- MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio), and instruct-TTS evaluations, approaching or surpassing closed-source models
Breakthrough Assessment
Methodology
- Pretrain a 7B-parameter audio language model (MiMo-Audio-7B-Base) using next-token prediction on over 100 million hours of diverse audio data, enabling emergent few-shot learning across speech and audio tasks
- Conduct systematic evaluation of few-shot generalization capabilities on both standard benchmarks and held-out tasks (voice conversion, style transfer, speech editing, speech continuation) to characterize emergent abilities
- Perform post-training via a curated diverse instruction-tuning corpus and introduce chain-of-thought / thinking mechanisms for both audio understanding and generation to produce MiMo-Audio-7B-Instruct
System Components
Large-scale pretrained audio language model trained on 100M+ hours of audio using next-token prediction, serving as the few-shot generalist foundation
Over one hundred million hours of diverse audio data covering speech, music, environmental sounds, and conversational content used for scalable pretraining
A comprehensive benchmark suite assessing few-shot generalization across standard (MMAU, MMSU, MMAR, MMAU-Pro) and novel held-out audio tasks
Chain-of-thought style reasoning integrated at post-training stage into both audio understanding and generation pipelines to improve complex task performance
Instruction-tuned variant of the base model using a diverse curated corpus, achieving SOTA on spoken dialogue, audio QA, and instruct-TTS benchmarks
Results
| Benchmark | Prior Open-Source SOTA | MiMo-Audio-7B | Delta |
|---|---|---|---|
| MMAU (Audio Understanding) | Previous open-source best | New open-source SOTA | SOTA improvement |
| MMSU (Speech Understanding) | Previous open-source best | New open-source SOTA | SOTA improvement |
| MMAR | Previous open-source best | New open-source SOTA | SOTA improvement |
| MMAU-Pro | Previous open-source best | New open-source SOTA | SOTA improvement |
| Big Bench Audio (Dialogue) | Previous open-source best | New open-source SOTA | SOTA improvement |
| Instruct-TTS Evaluation | Previous open-source best | Approaches/surpasses closed-source | Significant improvement |
| Zero-shot Voice Conversion | Not supported (requires fine-tuning) | Supported without fine-tuning | Emergent capability |
Key Takeaways
- Scaling audio pretraining data to 100M+ hours is sufficient to elicit GPT-3-like emergent few-shot generalization in audio models, suggesting practitioners should prioritize data scale over task-specific architectures when building general-purpose audio systems
- Integrating chain-of-thought reasoning (thinking mechanisms) at the instruction-tuning stage meaningfully improves both audio understanding and generation quality, making it a worthwhile post-training technique for audio LLMs
- Open-source models can now approach or match closed-source audio model performance; MiMo-Audio's released checkpoints and evaluation suite provide a strong baseline and benchmarking infrastructure for the community
Abstract
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.