Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration
Problem Statement
The growing body of LLM-based ASR research suffers from inadequately justified design decisions, making it difficult to attribute performance improvements to specific choices. This lack of rigor impedes scientific progress and creates confusion about what truly matters in system design. The paper addresses the need for a principled, comprehensive ablation study to identify essential vs. superfluous components in LLM-ASR pipelines.
Key Novelty
- Comprehensive systematic benchmarking of design choices in LLM-based ASR systems, providing empirical justification for architectural decisions rather than intuition-based design
- Empirical finding that simple, clean setups without delicate task-specific engineering are sufficient and competitive with heavily engineered LLM-ASR systems
- Exploration of capability emergence during cross-modal alignment between speech encoders and LLMs, shedding light on how ASR abilities develop during multimodal training
Evaluation Highlights
- Achieves strong competitive performance on LibriSpeech benchmark against both LLM-based and non-LLM-based ASR models
- Demonstrates strong results on GigaSpeech dataset, validating generalization across different speech corpora and recording conditions
Breakthrough Assessment
Methodology
- Identify and enumerate key design dimensions in LLM-based ASR systems (e.g., speech encoder choice, connector/adapter architecture, training strategy, prompt design, LLM backbone) and formulate ablation experiments for each
- Train and evaluate LLM-ASR models across these design axes using controlled experiments on LibriSpeech and GigaSpeech, systematically isolating the impact of each design decision on WER and related metrics
- Analyze capability emergence by examining model behavior during cross-modal alignment training stages to understand how speech understanding develops when integrating speech encoders with LLMs
System Components
Pre-trained speech model (e.g., Whisper, HuBERT-style) that converts raw audio into dense feature representations for downstream LLM processing
Interface module bridging speech encoder outputs and LLM input space; the paper investigates various designs and finds simple connectors are sufficient
Pre-trained LLM that receives speech features (via the connector) and generates text transcriptions, leveraging its language modeling capacity for ASR
Training procedure that aligns the speech modality with the LLM's text space; the paper analyzes capability emergence phenomena during this alignment process
Design of task instructions and prompts fed to the LLM; ablated to show minimal task-specific prompting is competitive with elaborate prompt engineering
Results
| Benchmark | Baseline (prior LLM-ASR) | This Paper | Delta |
|---|---|---|---|
| LibriSpeech test-clean (WER) | Competitive LLM-ASR systems | Strong performance (SOTA-level) | Favorable or on-par |
| LibriSpeech test-other (WER) | Competitive LLM-ASR systems | Strong performance (SOTA-level) | Favorable or on-par |
| GigaSpeech (WER) | Competitive non-LLM and LLM baselines | Strong performance | Favorable or on-par |
| System complexity | High (task-specific components) | Low (clean minimal design) | Significantly reduced |
Key Takeaways
- Practitioners building LLM-based ASR systems should prefer clean, simple architectures over heavily engineered task-specific designs — complexity does not reliably yield better performance and obscures what actually matters
- When integrating speech encoders with LLMs, the connector/adapter design and prompting strategy may matter less than generally assumed; investing effort in data quality and alignment training is likely more impactful
- Benchmarking rigor is critical in multimodal LLM research: always ablate individual components and justify design decisions empirically to avoid false attribution of performance gains to irrelevant choices
Abstract
In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Despite the growing body of research in this area, we find that many crucial design decisions in LLM-based ASR systems are often inadequately justified. This lack of clarity impedes the field's progress, making it challenging to pinpoint which design choices truly improve model performance. To address these challenges, we conduct a comprehensive series of experiments that explore various aspects, leading to the optimal LLM-based ASR system. We found that delicate designs are not necessary, while a clean setup with little task-specific design is competent. The models achieve strong performance on the Librispeech and Gigaspeech datasets, compared to both LLM-based models and non-LLM-based models. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.