ESPnet-SpeechLM: An Open Speech Language Model Toolkit
Problem Statement
Building speech language models requires integrating heterogeneous pipelines for data preprocessing, tokenization, training, and evaluation, creating high barriers to entry and reproducibility issues. Existing tools lack a unified framework that covers the full SpeechLM development lifecycle across diverse tasks. This fragmentation slows research progress and limits accessibility for practitioners without significant infrastructure resources.
Key Novelty
- Unified sequential modeling abstraction that frames all speech processing tasks (ASR, TTS, spoken dialogue, etc.) under a single framework with configurable task templates
- End-to-end cohesive workflow integrating data preprocessing, pre-training, inference, and evaluation in a single standardized toolkit
- Demonstration of a competitive 1.7B-parameter model pre-trained on both text and speech tasks, with fully transparent and reproducible recipes
Evaluation Highlights
- A 1.7B-parameter SpeechLM trained using the toolkit achieves competitive performance across diverse speech and text benchmarks
- Multiple use cases demonstrated spanning various speech processing tasks, validating the toolkit's flexibility and scalability
Breakthrough Assessment
Methodology
- Frame all speech processing tasks as universal sequential modeling problems, unifying diverse tasks (ASR, TTS, spoken QA, etc.) under a single token-sequence prediction paradigm
- Implement modular, configurable pipeline stages—data preprocessing, tokenization, model pre-training, and evaluation—allowing users to define task templates and swap components
- Pre-train and benchmark SpeechLMs (up to 1.7B parameters) using the toolkit's recipes on combined text and speech corpora, validating toolkit capabilities across multiple benchmarks
System Components
Allows users to define and configure speech/text tasks as sequential modeling problems with standardized input-output formats
Handles audio feature extraction, speech tokenization, and text tokenization in a unified and reproducible manner
Scalable training infrastructure supporting large SpeechLMs (e.g., 1.7B parameters) on mixed speech and text datasets
Configurable decoding and generation module supporting diverse speech and language tasks at inference time
Standardized benchmark evaluation across multiple speech and NLP tasks to assess model performance reproducibly
Transparent, end-to-end runnable scripts and configurations for building competitive SpeechLMs from scratch
Results
| Benchmark/Task | Baseline | ESPnet-SpeechLM (1.7B) | Delta |
|---|---|---|---|
| Diverse speech benchmarks | Task-specific models | Competitive unified model | Comparable or better |
| Reproducibility | Ad-hoc pipelines | Fully reproducible recipes | Standardized |
| Development effort | High (custom pipelines) | Low (unified toolkit) | Significant reduction |
Key Takeaways
- ML practitioners can use ESPnet-SpeechLM to rapidly prototype and train speech language models without building custom pipelines, significantly reducing development overhead
- The task template abstraction makes it straightforward to extend the toolkit to new speech or multimodal tasks by defining input-output token sequences, enabling easy customization
- Fully open-source recipes and a pre-trained 1.7B model provide strong starting points for fine-tuning or benchmarking, making cutting-edge SpeechLM research accessible to academic labs with limited resources
Abstract
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.