SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant

SOVA-Bench is a comprehensive evaluation framework for LLM-based voice assistants that assesses both semantic accuracy and acoustic quality of generated speech, addressing a critical gap in existing benchmarks.

Problem Statement

Current speech LLM benchmarks focus primarily on speech understanding and semantic accuracy, neglecting the acoustic quality of generated speech responses. As voice assistants advance toward natural, spontaneous speech generation, evaluation frameworks have not kept pace with the need to assess vivid and expressive speech flow. This gap makes it difficult to systematically compare or improve LLM-based voice assistants on real conversational quality metrics.

Key Novelty

Introduces SOVA-Bench, one of the first systematic benchmarks combining general knowledge, speech recognition/understanding, and both semantic and acoustic generative ability evaluation for speech LLMs
Explicitly quantifies acoustic quality of generated speech responses, going beyond semantic accuracy to capture naturalness and spontaneity of voice output
Provides a unified comparative framework across multiple available speech LLMs, enabling standardized cross-model evaluation

Evaluation Highlights

Comprehensive multi-dimensional assessment covering general knowledge, ASR/SLU tasks, semantic generation quality, and acoustic generation quality across available speech LLMs
Benchmark reveals disparities between models in acoustic quality that are invisible to semantic-only evaluations, highlighting the need for multi-faceted assessment

Breakthrough Assessment

5/10 SOVA-Bench is a solid and timely contribution that fills a real gap in speech LLM evaluation by incorporating acoustic quality metrics, but it is primarily a benchmarking/evaluation paper rather than a methodological or modeling advance, limiting its transformative impact.

Methodology

Define evaluation dimensions spanning general knowledge (LLM capability), speech recognition and understanding (perception), and generative ability (both semantic correctness and acoustic quality)
Curate or adapt datasets and metrics for each dimension, incorporating acoustic quality measures such as naturalness, prosody, and spontaneity alongside traditional semantic metrics
Run available speech LLMs through the benchmark pipeline and produce a systematic comparative analysis to identify strengths, weaknesses, and directions for improvement

System Components

General Knowledge Module

Evaluates the underlying LLM's factual and reasoning capabilities when accessed through the speech interface

Speech Recognition & Understanding Module

Assesses the model's ability to accurately transcribe and interpret spoken user instructions (perception side)

Semantic Generative Evaluation

Measures the correctness and relevance of the content in the model's speech responses

Acoustic Generative Evaluation

Quantifies the naturalness, prosody, and spontaneity of the generated speech output, capturing qualities beyond semantic accuracy

Results

Evaluation Dimension	Prior Benchmarks	SOVA-Bench	Delta
General Knowledge Coverage	Partial (text-only LLM evals)	Included in speech context	Speech-grounded LLM reasoning
Speech Recognition/Understanding	Covered by existing benchmarks	Included as one dimension	Unified framework
Semantic Generation Quality	Covered in some works	Included systematically	Standardized comparison
Acoustic Generation Quality	Not covered / neglected	Explicitly quantified	New capability unlocked

Key Takeaways

When building or evaluating speech LLMs, acoustic quality (naturalness, prosody, spontaneity) must be measured explicitly — semantic accuracy alone is insufficient for real conversational voice assistants
SOVA-Bench provides ML practitioners with a ready-made multi-dimensional evaluation framework to systematically compare speech LLMs across perception, reasoning, and generation axes
The benchmark highlights that current speech LLMs likely have uneven performance profiles across its dimensions, suggesting targeted areas (especially acoustic generation) where research investment is most needed

Abstract

Thanks to the steady progress of large language models (LLMs), speech encoding algorithms and vocoder structure, recent advancements have enabled generating speech response directly from a user instruction. However, benchmarking the generated speech quality has been a neglected but critical issue, considering the shift from the pursuit of semantic accuracy to vivid and spontaneous speech flow. Previous evaluation focused on the speech-understanding ability, lacking a quantification of acoustic quality. In this paper, we propose Speech cOnversational Voice Assistant Benchmark (SOVA-Bench), providing a comprehension comparison of the general knowledge, speech recognition and understanding, along with both semantic and acoustic generative ability between available speech LLMs. To the best of our knowledge, SOVA-Bench is one of the most systematic evaluation frameworks for speech LLMs, inspiring the direction of voice interaction systems.