← Back to Papers

SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant

Yixuan Hou, Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Interspeech | 2025
SOVA-Bench is a comprehensive evaluation framework for LLM-based voice assistants that assesses both semantic accuracy and acoustic quality of generated speech, addressing a critical gap in existing benchmarks.

Problem Statement

Current speech LLM benchmarks focus primarily on speech understanding and semantic accuracy, neglecting the acoustic quality of generated speech responses. As voice assistants advance toward natural, spontaneous speech generation, evaluation frameworks have not kept pace with the need to assess vivid and expressive speech flow. This gap makes it difficult to systematically compare or improve LLM-based voice assistants on real conversational quality metrics.

Key Novelty

  • Introduces SOVA-Bench, one of the first systematic benchmarks combining general knowledge, speech recognition/understanding, and both semantic and acoustic generative ability evaluation for speech LLMs
  • Explicitly quantifies acoustic quality of generated speech responses, going beyond semantic accuracy to capture naturalness and spontaneity of voice output
  • Provides a unified comparative framework across multiple available speech LLMs, enabling standardized cross-model evaluation

Evaluation Highlights

  • Comprehensive multi-dimensional assessment covering general knowledge, ASR/SLU tasks, semantic generation quality, and acoustic generation quality across available speech LLMs
  • Benchmark reveals disparities between models in acoustic quality that are invisible to semantic-only evaluations, highlighting the need for multi-faceted assessment

Breakthrough Assessment

5/10 SOVA-Bench is a solid and timely contribution that fills a real gap in speech LLM evaluation by incorporating acoustic quality metrics, but it is primarily a benchmarking/evaluation paper rather than a methodological or modeling advance, limiting its transformative impact.

Methodology

  1. Define evaluation dimensions spanning general knowledge (LLM capability), speech recognition and understanding (perception), and generative ability (both semantic correctness and acoustic quality)
  2. Curate or adapt datasets and metrics for each dimension, incorporating acoustic quality measures such as naturalness, prosody, and spontaneity alongside traditional semantic metrics
  3. Run available speech LLMs through the benchmark pipeline and produce a systematic comparative analysis to identify strengths, weaknesses, and directions for improvement

System Components

General Knowledge Module

Evaluates the underlying LLM's factual and reasoning capabilities when accessed through the speech interface

Speech Recognition & Understanding Module

Assesses the model's ability to accurately transcribe and interpret spoken user instructions (perception side)

Semantic Generative Evaluation

Measures the correctness and relevance of the content in the model's speech responses

Acoustic Generative Evaluation

Quantifies the naturalness, prosody, and spontaneity of the generated speech output, capturing qualities beyond semantic accuracy

Results

Evaluation Dimension Prior Benchmarks SOVA-Bench Delta
General Knowledge Coverage Partial (text-only LLM evals) Included in speech context Speech-grounded LLM reasoning
Speech Recognition/Understanding Covered by existing benchmarks Included as one dimension Unified framework
Semantic Generation Quality Covered in some works Included systematically Standardized comparison
Acoustic Generation Quality Not covered / neglected Explicitly quantified New capability unlocked

Key Takeaways

  • When building or evaluating speech LLMs, acoustic quality (naturalness, prosody, spontaneity) must be measured explicitly — semantic accuracy alone is insufficient for real conversational voice assistants
  • SOVA-Bench provides ML practitioners with a ready-made multi-dimensional evaluation framework to systematically compare speech LLMs across perception, reasoning, and generation axes
  • The benchmark highlights that current speech LLMs likely have uneven performance profiles across its dimensions, suggesting targeted areas (especially acoustic generation) where research investment is most needed

Abstract

Thanks to the steady progress of large language models (LLMs), speech encoding algorithms and vocoder structure, recent advancements have enabled generating speech response directly from a user instruction. However, benchmarking the generated speech quality has been a neglected but critical issue, considering the shift from the pursuit of semantic accuracy to vivid and spontaneous speech flow. Previous evaluation focused on the speech-understanding ability, lacking a quantification of acoustic quality. In this paper, we propose Speech cOnversational Voice Assistant Benchmark (SOVA-Bench), providing a comprehension comparison of the general knowledge, speech recognition and understanding, along with both semantic and acoustic generative ability between available speech LLMs. To the best of our knowledge, SOVA-Bench is one of the most systematic evaluation frameworks for speech LLMs, inspiring the direction of voice interaction systems.

Generated on 2026-03-02 using Claude