Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration

This paper systematically benchmarks and analyzes design choices in LLM-based automatic speech recognition systems, demonstrating that a clean, minimally task-specific setup outperforms complex architectures while achieving state-of-the-art results on standard benchmarks.

Problem Statement

The growing body of LLM-based ASR research suffers from inadequately justified design decisions, making it difficult to attribute performance improvements to specific choices. This lack of rigor impedes scientific progress and creates confusion about what truly matters in system design. The paper addresses the need for a principled, comprehensive ablation study to identify essential vs. superfluous components in LLM-ASR pipelines.

Key Novelty

Comprehensive systematic benchmarking of design choices in LLM-based ASR systems, providing empirical justification for architectural decisions rather than intuition-based design
Empirical finding that simple, clean setups without delicate task-specific engineering are sufficient and competitive with heavily engineered LLM-ASR systems
Exploration of capability emergence during cross-modal alignment between speech encoders and LLMs, shedding light on how ASR abilities develop during multimodal training

Evaluation Highlights

Achieves strong competitive performance on LibriSpeech benchmark against both LLM-based and non-LLM-based ASR models
Demonstrates strong results on GigaSpeech dataset, validating generalization across different speech corpora and recording conditions

Breakthrough Assessment

5/10 The paper makes a solid empirical contribution by systematically clarifying design decisions in LLM-based ASR, but it is primarily a rigorous benchmarking and ablation study rather than a fundamentally new method or paradigm. Its value lies in scientific clarity and reproducibility rather than a novel architectural breakthrough.

Methodology

Identify and enumerate key design dimensions in LLM-based ASR systems (e.g., speech encoder choice, connector/adapter architecture, training strategy, prompt design, LLM backbone) and formulate ablation experiments for each
Train and evaluate LLM-ASR models across these design axes using controlled experiments on LibriSpeech and GigaSpeech, systematically isolating the impact of each design decision on WER and related metrics
Analyze capability emergence by examining model behavior during cross-modal alignment training stages to understand how speech understanding develops when integrating speech encoders with LLMs

System Components

Speech Foundation Encoder

Pre-trained speech model (e.g., Whisper, HuBERT-style) that converts raw audio into dense feature representations for downstream LLM processing

Cross-Modal Connector/Adapter

Interface module bridging speech encoder outputs and LLM input space; the paper investigates various designs and finds simple connectors are sufficient

Large Language Model Backbone

Pre-trained LLM that receives speech features (via the connector) and generates text transcriptions, leveraging its language modeling capacity for ASR

Modal Alignment Training

Training procedure that aligns the speech modality with the LLM's text space; the paper analyzes capability emergence phenomena during this alignment process

Prompting Strategy

Design of task instructions and prompts fed to the LLM; ablated to show minimal task-specific prompting is competitive with elaborate prompt engineering

Results

Benchmark	Baseline (prior LLM-ASR)	This Paper	Delta
LibriSpeech test-clean (WER)	Competitive LLM-ASR systems	Strong performance (SOTA-level)	Favorable or on-par
LibriSpeech test-other (WER)	Competitive LLM-ASR systems	Strong performance (SOTA-level)	Favorable or on-par
GigaSpeech (WER)	Competitive non-LLM and LLM baselines	Strong performance	Favorable or on-par
System complexity	High (task-specific components)	Low (clean minimal design)	Significantly reduced

Key Takeaways

Practitioners building LLM-based ASR systems should prefer clean, simple architectures over heavily engineered task-specific designs — complexity does not reliably yield better performance and obscures what actually matters
When integrating speech encoders with LLMs, the connector/adapter design and prompting strategy may matter less than generally assumed; investing effort in data quality and alignment training is likely more impactful
Benchmarking rigor is critical in multimodal LLM research: always ablate individual components and justify design decisions empirically to avoid false attribution of performance gains to irrelevant choices

Abstract

In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Despite the growing body of research in this area, we find that many crucial design decisions in LLM-based ASR systems are often inadequately justified. This lack of clarity impedes the field's progress, making it challenging to pinpoint which design choices truly improve model performance. To address these challenges, we conduct a comprehensive series of experiments that explore various aspects, leading to the optimal LLM-based ASR system. We found that delicate designs are not necessary, while a clean setup with little task-specific design is competent. The models achieve strong performance on the Librispeech and Gigaspeech datasets, compared to both LLM-based models and non-LLM-based models. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.