SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
Problem Statement
Existing open-source MLLM frameworks predominantly treat vision as the primary modality, offering limited architectural support, training recipes, and tooling for speech, audio, and music tasks. This forces audio-language researchers to build infrastructure from scratch and invest significant effort in code development and hyperparameter tuning. The absence of a dedicated audio-focused MLLM framework slows down research iteration and reproducibility in the audio-language community.
Key Novelty
- Modular architecture supporting interchangeable encoders, projectors, LLMs, and parameter-efficient fine-tuning (PEFT) plugins specifically designed for audio modalities
- Comprehensive training and inference recipes for mainstream audio-language tasks including ASR, Automated Audio Captioning (AAC), and Music Captioning (MC), with some achieving state-of-the-art results
- First dedicated open-source MLLM framework treating speech, audio, and music as first-class modalities rather than extensions of a vision-centric system
Evaluation Highlights
- LLM-based Automatic Speech Recognition (ASR) checkpoints reach or approach state-of-the-art performance on standard benchmarks
- Automated Audio Captioning (AAC) and Music Captioning (MC) recipes achieve competitive performance, with associated techniques accepted in peer-reviewed academic venues
Breakthrough Assessment
Methodology
- Design a modular framework with plug-and-play components: audio encoders (e.g., Whisper, EnCodec), projection layers, backbone LLMs, and PEFT adapters (e.g., LoRA) that can be configured via a unified config system
- Develop task-specific training and inference pipelines for mainstream audio-language tasks (ASR, AAC, Music Captioning, etc.), with curated hyperparameter recipes and data engineering guidelines
- Release high-performance pretrained checkpoints and encourage community contributions to continually expand supported tasks, encoders, and LLM backbones
System Components
Interchangeable pretrained audio encoder modules (e.g., Whisper, audio foundation models) that extract representations from speech, audio, or music inputs
Adapter modules that bridge the encoder output space to the LLM input space, enabling cross-modal alignment
Pluggable large language model backbone (e.g., LLaMA, Qwen) that processes fused audio-language representations for generation tasks
Parameter-efficient fine-tuning modules such as LoRA that enable efficient task-specific adaptation without full model retraining
Pre-defined training and inference configurations for tasks like ASR, Automated Audio Captioning, and Music Captioning, lowering the barrier for new users
Results
| Task | Prior MLLM Frameworks | SLAM-LLM | Delta |
|---|---|---|---|
| LLM-based ASR | Fragmented, no unified support | Near/at state-of-the-art | Significant usability improvement |
| Automated Audio Captioning (AAC) | Limited or no dedicated recipes | Competitive/SOTA performance | Substantial gap closed |
| Music Captioning (MC) | Not supported in major frameworks | Competitive performance | New capability enabled |
| Framework modularity | Vision-centric, audio as afterthought | Audio-first modular design | Purpose-built for audio-language tasks |
Key Takeaways
- Researchers building audio-language models should adopt SLAM-LLM as a starting point to avoid reimplementing encoders, projection layers, and training loops — significantly reducing time-to-experiment
- The modular encoder-projector-LLM design pattern used in SLAM-LLM is a practical blueprint for building any domain-specific MLLM, and practitioners can swap components (e.g., different audio encoders or LLM backbones) with minimal code changes
- PEFT methods like LoRA are integral to the framework, confirming that parameter-efficient fine-tuning is the practical standard for adapting large LLMs to new modalities without prohibitive compute costs
Abstract
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.