SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

Ziyang Ma, Guanrou Yang, Wenxi Chen, Zhifu Gao, Yexing Du, Xiquan Li, Zhisheng Zheng, Haina Zhu, Jianheng Zhuo, Zheshu Song, Ruiyang Xu, Tiranrui Wang, Yifan Yang, Yanqiao Zhu, Zhikang Niu, Liumeng Xue, Yinghao Ma, Rui Yuan, Shiliang Zhang, Kai Yu, E. Chng, Xie Chen

IEEE Journal on Selected Topics in Signal Processing | 2026

Semantic Scholar

LLMs Multimodal Speech & Voice

SLAM-LLM is a modular, open-source framework specifically designed for building and training Multimodal Large Language Models (MLLMs) focused on speech, audio, and music modalities, filling a critical gap left by vision-centric frameworks like LLaVA.

Problem Statement

Existing open-source MLLM frameworks predominantly treat vision as the primary modality, offering limited architectural support, training recipes, and tooling for speech, audio, and music tasks. This forces audio-language researchers to build infrastructure from scratch and invest significant effort in code development and hyperparameter tuning. The absence of a dedicated audio-focused MLLM framework slows down research iteration and reproducibility in the audio-language community.

Key Novelty

Modular architecture supporting interchangeable encoders, projectors, LLMs, and parameter-efficient fine-tuning (PEFT) plugins specifically designed for audio modalities
Comprehensive training and inference recipes for mainstream audio-language tasks including ASR, Automated Audio Captioning (AAC), and Music Captioning (MC), with some achieving state-of-the-art results
First dedicated open-source MLLM framework treating speech, audio, and music as first-class modalities rather than extensions of a vision-centric system

Evaluation Highlights

LLM-based Automatic Speech Recognition (ASR) checkpoints reach or approach state-of-the-art performance on standard benchmarks
Automated Audio Captioning (AAC) and Music Captioning (MC) recipes achieve competitive performance, with associated techniques accepted in peer-reviewed academic venues

Signal Assessment

6/10 SLAM-LLM is a solid and practically valuable contribution that consolidates audio-language model development into a unified, community-driven framework, but it is primarily an infrastructure/tooling paper rather than a novel modeling or algorithmic breakthrough. Its impact lies in accelerating research velocity rather than introducing fundamentally new techniques.

Methodology

Design a modular framework with plug-and-play components: audio encoders (e.g., Whisper, EnCodec), projection layers, backbone LLMs, and PEFT adapters (e.g., LoRA) that can be configured via a unified config system
Develop task-specific training and inference pipelines for mainstream audio-language tasks (ASR, AAC, Music Captioning, etc.), with curated hyperparameter recipes and data engineering guidelines
Release high-performance pretrained checkpoints and encourage community contributions to continually expand supported tasks, encoders, and LLM backbones

System Components

Audio Encoders

Interchangeable pretrained audio encoder modules (e.g., Whisper, audio foundation models) that extract representations from speech, audio, or music inputs

Projector Layers

Adapter modules that bridge the encoder output space to the LLM input space, enabling cross-modal alignment

LLM Backbone

Pluggable large language model backbone (e.g., LLaMA, Qwen) that processes fused audio-language representations for generation tasks

PEFT Plugins

Parameter-efficient fine-tuning modules such as LoRA that enable efficient task-specific adaptation without full model retraining

Task Recipes

Pre-defined training and inference configurations for tasks like ASR, Automated Audio Captioning, and Music Captioning, lowering the barrier for new users

Results

Task	Prior MLLM Frameworks	SLAM-LLM	Delta
LLM-based ASR	Fragmented, no unified support	Near/at state-of-the-art	Significant usability improvement
Automated Audio Captioning (AAC)	Limited or no dedicated recipes	Competitive/SOTA performance	Substantial gap closed
Music Captioning (MC)	Not supported in major frameworks	Competitive performance	New capability enabled
Framework modularity	Vision-centric, audio as afterthought	Audio-first modular design	Purpose-built for audio-language tasks

Key Takeaways

Researchers building audio-language models should adopt SLAM-LLM as a starting point to avoid reimplementing encoders, projection layers, and training loops — significantly reducing time-to-experiment
The modular encoder-projector-LLM design pattern used in SLAM-LLM is a practical blueprint for building any domain-specific MLLM, and practitioners can swap components (e.g., different audio encoders or LLM backbones) with minimal code changes
PEFT methods like LoRA are integral to the framework, confirming that parameter-efficient fine-tuning is the practical standard for adapting large LLMs to new modalities without prohibitive compute costs

Abstract

The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.

Generated from available metadata and abstract on 2026-03-02 using Claude.