Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yi-Min Guo, Wei-feng Xue

arXiv.org | 2025

Semantic Scholar

LLMs Reasoning & Planning Speech & Voice Prompt Engineering Benchmark

Spark-TTS introduces BiCodec, a single-stream speech codec that decomposes speech into semantic and global speaker tokens, enabling an LLM-based TTS system that achieves state-of-the-art zero-shot voice cloning with fine-grained controllability through chain-of-thought generation.

Problem Statement

Existing LLM-based TTS foundation models require multi-stage pipelines or complex multi-codebook architectures (e.g., residual vector quantization with multiple codebooks) that reduce inference efficiency and complicate system integration. These approaches also struggle to provide fine-grained, attribute-level voice control beyond simple reference-audio cloning. A unified, efficient architecture that supports both coarse and precise voice customization without sacrificing quality has been lacking.

Key Novelty

BiCodec: A single-stream speech codec that disentangles speech into low-bitrate semantic tokens (linguistic content) and fixed-length global tokens (speaker attributes), eliminating the need for multi-codebook prediction
Chain-of-Thought (CoT) generation framework integrated with Qwen2.5 LLM enabling both coarse-grained control (gender, speaking style) and fine-grained control (precise pitch values, speaking rate) for voice customization beyond reference-based synthesis
VoxBox: A meticulously curated 100,000-hour speech dataset with comprehensive attribute annotations designed to facilitate research in controllable TTS

Evaluation Highlights

Spark-TTS achieves state-of-the-art zero-shot voice cloning performance, surpassing existing foundation TTS models on standard benchmarks
The system generates highly customizable synthetic voices with fine-grained attribute control that exceeds the expressiveness limitations of reference-audio-based synthesis approaches

Breakthrough Assessment

7/10 Spark-TTS represents a significant architectural advance by replacing complex multi-codebook prediction with a clean two-token-type disentanglement, and the CoT-driven fine-grained voice control is a meaningful capability expansion. However, it builds on established LLM-TTS paradigms rather than redefining the field entirely.

Methodology

Train BiCodec to decompose speech into two complementary token streams: variable-length semantic tokens capturing linguistic content at low bitrate, and fixed-length global tokens encoding speaker identity and attributes
Fine-tune Qwen2.5 LLM on the VoxBox 100K-hour annotated dataset to generate speech tokens autoregressively, using a chain-of-thought prompting strategy that first reasons about target voice attributes before generating tokens
Decode the generated single-stream token sequence through BiCodec's vocoder to synthesize high-quality speech, with the decoupled representation enabling both zero-shot cloning (from reference audio global tokens) and attribute-controlled synthesis (from text-specified parameters)

System Components

BiCodec

Single-stream speech codec that disentangles speech into low-bitrate semantic tokens (linguistic/content information) and fixed-length global tokens (speaker identity, style attributes), enabling efficient single-pass LLM prediction

Qwen2.5 LLM Backbone

Pre-trained large language model adapted for autoregressive speech token generation, leveraging its language understanding for chain-of-thought reasoning over voice attributes

Chain-of-Thought (CoT) Generation

Generation strategy where the model first explicitly reasons about desired voice attributes (gender, pitch, speaking rate, style) before producing speech tokens, enabling fine-grained controllable synthesis

VoxBox Dataset

100,000-hour curated speech corpus with comprehensive per-sample attribute annotations (gender, speaking style, pitch, rate, etc.) designed specifically for training controllable TTS models

Results

Metric/Benchmark	Baseline (Best Prior)	Spark-TTS	Delta
Zero-shot Voice Cloning Quality	State-of-the-art prior models	New SOTA	Surpasses existing foundation models
Voice Attribute Controllability	Reference-audio limited	Fine-grained text-specified control	Enables precise pitch/rate beyond reference cloning
Architecture Complexity	Multi-stage / multi-codebook	Single-stream, single-stage LLM	Reduced complexity, improved integration
Training Data	Typically <50K hours annotated	100K hours (VoxBox)	+100K hours with attribute annotations

Key Takeaways

Decoupling speech tokens into semantic (content) and global (speaker) streams is a powerful design pattern that simplifies LLM integration for TTS — practitioners building voice AI systems should consider this two-token-type abstraction over residual codebook approaches
Chain-of-thought prompting is not just for reasoning tasks — applying CoT to intermediate attribute specification before token generation is an effective way to inject fine-grained control into generative audio models without architectural overhauls
The release of BiCodec, pre-trained models, and the VoxBox 100K-hour annotated dataset makes Spark-TTS a strong foundation for downstream controllable TTS research and production systems, particularly for zero-shot voice customization applications

Abstract

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

Generated on 2026-03-02 using Claude