SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
Problem Statement
Current speech LLM benchmarks focus primarily on speech understanding and semantic accuracy, neglecting the acoustic quality of generated speech responses. As voice assistants advance toward natural, spontaneous speech generation, evaluation frameworks have not kept pace with the need to assess vivid and expressive speech flow. This gap makes it difficult to systematically compare or improve LLM-based voice assistants on real conversational quality metrics.
Key Novelty
- Introduces SOVA-Bench, one of the first systematic benchmarks combining general knowledge, speech recognition/understanding, and both semantic and acoustic generative ability evaluation for speech LLMs
- Explicitly quantifies acoustic quality of generated speech responses, going beyond semantic accuracy to capture naturalness and spontaneity of voice output
- Provides a unified comparative framework across multiple available speech LLMs, enabling standardized cross-model evaluation
Evaluation Highlights
- Comprehensive multi-dimensional assessment covering general knowledge, ASR/SLU tasks, semantic generation quality, and acoustic generation quality across available speech LLMs
- Benchmark reveals disparities between models in acoustic quality that are invisible to semantic-only evaluations, highlighting the need for multi-faceted assessment
Breakthrough Assessment
Methodology
- Define evaluation dimensions spanning general knowledge (LLM capability), speech recognition and understanding (perception), and generative ability (both semantic correctness and acoustic quality)
- Curate or adapt datasets and metrics for each dimension, incorporating acoustic quality measures such as naturalness, prosody, and spontaneity alongside traditional semantic metrics
- Run available speech LLMs through the benchmark pipeline and produce a systematic comparative analysis to identify strengths, weaknesses, and directions for improvement
System Components
Evaluates the underlying LLM's factual and reasoning capabilities when accessed through the speech interface
Assesses the model's ability to accurately transcribe and interpret spoken user instructions (perception side)
Measures the correctness and relevance of the content in the model's speech responses
Quantifies the naturalness, prosody, and spontaneity of the generated speech output, capturing qualities beyond semantic accuracy
Results
| Evaluation Dimension | Prior Benchmarks | SOVA-Bench | Delta |
|---|---|---|---|
| General Knowledge Coverage | Partial (text-only LLM evals) | Included in speech context | Speech-grounded LLM reasoning |
| Speech Recognition/Understanding | Covered by existing benchmarks | Included as one dimension | Unified framework |
| Semantic Generation Quality | Covered in some works | Included systematically | Standardized comparison |
| Acoustic Generation Quality | Not covered / neglected | Explicitly quantified | New capability unlocked |
Key Takeaways
- When building or evaluating speech LLMs, acoustic quality (naturalness, prosody, spontaneity) must be measured explicitly — semantic accuracy alone is insufficient for real conversational voice assistants
- SOVA-Bench provides ML practitioners with a ready-made multi-dimensional evaluation framework to systematically compare speech LLMs across perception, reasoning, and generation axes
- The benchmark highlights that current speech LLMs likely have uneven performance profiles across its dimensions, suggesting targeted areas (especially acoustic generation) where research investment is most needed
Abstract
Thanks to the steady progress of large language models (LLMs), speech encoding algorithms and vocoder structure, recent advancements have enabled generating speech response directly from a user instruction. However, benchmarking the generated speech quality has been a neglected but critical issue, considering the shift from the pursuit of semantic accuracy to vivid and spontaneous speech flow. Previous evaluation focused on the speech-understanding ability, lacking a quantification of acoustic quality. In this paper, we propose Speech cOnversational Voice Assistant Benchmark (SOVA-Bench), providing a comprehension comparison of the general knowledge, speech recognition and understanding, along with both semantic and acoustic generative ability between available speech LLMs. To the best of our knowledge, SOVA-Bench is one of the most systematic evaluation frameworks for speech LLMs, inspiring the direction of voice interaction systems.