ESPnet-SpeechLM: An Open Speech Language Model Toolkit

Jinchuan Tian, Jiatong Shi, William Chen, Siddhant Arora, Yoshiki Masuyama, Takashi Maekaku, Yihan Wu, Junyi Peng, Shikhar Bharadwaj, Yiwen Zhao, Samuele Cornell, Yifan Peng, Xiang Yue, Chao-Han Huck Yang, Graham Neubig, Shinji Watanabe

North American Chapter of the Association for Computational Linguistics | 2025

Semantic Scholar

Agents Speech & Voice Benchmark

ESPnet-SpeechLM is an open-source toolkit that standardizes the development of speech language models by framing all speech processing tasks as universal sequential modeling problems, enabling reproducible and scalable SpeechLM research.

Problem Statement

Building speech language models requires integrating heterogeneous pipelines for data preprocessing, tokenization, training, and evaluation, creating high barriers to entry and reproducibility issues. Existing tools lack a unified framework that covers the full SpeechLM development lifecycle across diverse tasks. This fragmentation slows research progress and limits accessibility for practitioners without significant infrastructure resources.

Key Novelty

Unified sequential modeling abstraction that frames all speech processing tasks (ASR, TTS, spoken dialogue, etc.) under a single framework with configurable task templates
End-to-end cohesive workflow integrating data preprocessing, pre-training, inference, and evaluation in a single standardized toolkit
Demonstration of a competitive 1.7B-parameter model pre-trained on both text and speech tasks, with fully transparent and reproducible recipes

Evaluation Highlights

A 1.7B-parameter SpeechLM trained using the toolkit achieves competitive performance across diverse speech and text benchmarks
Multiple use cases demonstrated spanning various speech processing tasks, validating the toolkit's flexibility and scalability

Breakthrough Assessment

5/10 ESPnet-SpeechLM is a solid engineering and community contribution that lowers the barrier for SpeechLM development and promotes reproducibility, but it is primarily a toolkit paper rather than a fundamental algorithmic advance, placing it in the solid contribution category.

Methodology

Frame all speech processing tasks as universal sequential modeling problems, unifying diverse tasks (ASR, TTS, spoken QA, etc.) under a single token-sequence prediction paradigm
Implement modular, configurable pipeline stages—data preprocessing, tokenization, model pre-training, and evaluation—allowing users to define task templates and swap components
Pre-train and benchmark SpeechLMs (up to 1.7B parameters) using the toolkit's recipes on combined text and speech corpora, validating toolkit capabilities across multiple benchmarks

System Components

Task Template System

Allows users to define and configure speech/text tasks as sequential modeling problems with standardized input-output formats

Data Preprocessing Pipeline

Handles audio feature extraction, speech tokenization, and text tokenization in a unified and reproducible manner

Pre-training Module

Scalable training infrastructure supporting large SpeechLMs (e.g., 1.7B parameters) on mixed speech and text datasets

Inference Engine

Configurable decoding and generation module supporting diverse speech and language tasks at inference time

Evaluation Framework

Standardized benchmark evaluation across multiple speech and NLP tasks to assess model performance reproducibly

Recipe Library

Transparent, end-to-end runnable scripts and configurations for building competitive SpeechLMs from scratch

Results

Benchmark/Task	Baseline	ESPnet-SpeechLM (1.7B)	Delta
Diverse speech benchmarks	Task-specific models	Competitive unified model	Comparable or better
Reproducibility	Ad-hoc pipelines	Fully reproducible recipes	Standardized
Development effort	High (custom pipelines)	Low (unified toolkit)	Significant reduction

Key Takeaways

ML practitioners can use ESPnet-SpeechLM to rapidly prototype and train speech language models without building custom pipelines, significantly reducing development overhead
The task template abstraction makes it straightforward to extend the toolkit to new speech or multimodal tasks by defining input-output token sequences, enabling easy customization
Fully open-source recipes and a pre-trained 1.7B model provide strong starting points for fine-tuning or benchmarking, making cutting-edge SpeechLM research accessible to academic labs with limited resources

Abstract

We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.

Generated on 2026-02-21 using Claude