← Back to Papers

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yi-Min Guo, Wei-feng Xue
arXiv.org | 2025
Spark-TTS introduces BiCodec, a single-stream speech codec that decomposes speech into semantic and global speaker tokens, enabling an LLM-based TTS system that achieves state-of-the-art zero-shot voice cloning with fine-grained controllability through chain-of-thought generation.

Problem Statement

Existing LLM-based TTS foundation models require multi-stage pipelines or complex multi-codebook architectures (e.g., residual vector quantization with multiple codebooks) that reduce inference efficiency and complicate system integration. These approaches also struggle to provide fine-grained, attribute-level voice control beyond simple reference-audio cloning. A unified, efficient architecture that supports both coarse and precise voice customization without sacrificing quality has been lacking.

Key Novelty

  • BiCodec: A single-stream speech codec that disentangles speech into low-bitrate semantic tokens (linguistic content) and fixed-length global tokens (speaker attributes), eliminating the need for multi-codebook prediction
  • Chain-of-Thought (CoT) generation framework integrated with Qwen2.5 LLM enabling both coarse-grained control (gender, speaking style) and fine-grained control (precise pitch values, speaking rate) for voice customization beyond reference-based synthesis
  • VoxBox: A meticulously curated 100,000-hour speech dataset with comprehensive attribute annotations designed to facilitate research in controllable TTS

Evaluation Highlights

  • Spark-TTS achieves state-of-the-art zero-shot voice cloning performance, surpassing existing foundation TTS models on standard benchmarks
  • The system generates highly customizable synthetic voices with fine-grained attribute control that exceeds the expressiveness limitations of reference-audio-based synthesis approaches

Breakthrough Assessment

7/10 Spark-TTS represents a significant architectural advance by replacing complex multi-codebook prediction with a clean two-token-type disentanglement, and the CoT-driven fine-grained voice control is a meaningful capability expansion. However, it builds on established LLM-TTS paradigms rather than redefining the field entirely.

Methodology

  1. Train BiCodec to decompose speech into two complementary token streams: variable-length semantic tokens capturing linguistic content at low bitrate, and fixed-length global tokens encoding speaker identity and attributes
  2. Fine-tune Qwen2.5 LLM on the VoxBox 100K-hour annotated dataset to generate speech tokens autoregressively, using a chain-of-thought prompting strategy that first reasons about target voice attributes before generating tokens
  3. Decode the generated single-stream token sequence through BiCodec's vocoder to synthesize high-quality speech, with the decoupled representation enabling both zero-shot cloning (from reference audio global tokens) and attribute-controlled synthesis (from text-specified parameters)

System Components

BiCodec

Single-stream speech codec that disentangles speech into low-bitrate semantic tokens (linguistic/content information) and fixed-length global tokens (speaker identity, style attributes), enabling efficient single-pass LLM prediction

Qwen2.5 LLM Backbone

Pre-trained large language model adapted for autoregressive speech token generation, leveraging its language understanding for chain-of-thought reasoning over voice attributes

Chain-of-Thought (CoT) Generation

Generation strategy where the model first explicitly reasons about desired voice attributes (gender, pitch, speaking rate, style) before producing speech tokens, enabling fine-grained controllable synthesis

VoxBox Dataset

100,000-hour curated speech corpus with comprehensive per-sample attribute annotations (gender, speaking style, pitch, rate, etc.) designed specifically for training controllable TTS models

Results

Metric/Benchmark Baseline (Best Prior) Spark-TTS Delta
Zero-shot Voice Cloning Quality State-of-the-art prior models New SOTA Surpasses existing foundation models
Voice Attribute Controllability Reference-audio limited Fine-grained text-specified control Enables precise pitch/rate beyond reference cloning
Architecture Complexity Multi-stage / multi-codebook Single-stream, single-stage LLM Reduced complexity, improved integration
Training Data Typically <50K hours annotated 100K hours (VoxBox) +100K hours with attribute annotations

Key Takeaways

  • Decoupling speech tokens into semantic (content) and global (speaker) streams is a powerful design pattern that simplifies LLM integration for TTS — practitioners building voice AI systems should consider this two-token-type abstraction over residual codebook approaches
  • Chain-of-thought prompting is not just for reasoning tasks — applying CoT to intermediate attribute specification before token generation is an effective way to inject fine-grained control into generative audio models without architectural overhauls
  • The release of BiCodec, pre-trained models, and the VoxBox 100K-hour annotated dataset makes Spark-TTS a strong foundation for downstream controllable TTS research and production systems, particularly for zero-shot voice customization applications

Abstract

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

Generated on 2026-03-02 using Claude