← Back to Papers

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
arXiv.org | 2025
VocalNet introduces multi-token prediction (MTP) to speech LLMs for the first time, enabling simultaneous improvements in generation speed and output quality for real-time voice interaction at 1B and 8B parameter scales.

Problem Statement

Speech LLMs have traditionally relied on next-token prediction (NTP), which creates a speed-quality tradeoff bottleneck limiting real-time voice interaction. Existing open-source speech LLMs lag behind proprietary Omni LLMs in performance, and inference latency remains a critical barrier for deployment. There is also a lack of reproducible, scalable training frameworks accessible to the research community.

Key Novelty

  • First application of multi-token prediction (MTP) to speech LLMs, breaking the assumption that NTP is the only viable autoregressive paradigm for speech generation
  • A scalable, model-agnostic training framework for real-time speech LLMs supporting both 1B and 8B parameter scales
  • Fully open-source release including model weights, inference code, training data, and framework implementation to foster reproducibility

Evaluation Highlights

  • VocalNet performs on par with mainstream proprietary Omni LLMs despite using limited training data, demonstrating strong data efficiency
  • VocalNet significantly surpasses existing open-source speech LLMs on generation quality benchmarks while also achieving lower latency than standard NTP-based approaches

Breakthrough Assessment

7/10 Applying MTP to speech LLMs is a meaningful architectural innovation that simultaneously improves speed and quality—two traditionally competing objectives—and the fully open-source release amplifies community impact, but the core MTP concept is adapted from text LLM research rather than invented from scratch.

Methodology

  1. Analyze the effect of multi-token prediction on speech generation characteristics, identifying how predicting multiple tokens simultaneously benefits both speed and acoustic coherence in speech sequences
  2. Design a straightforward MTP implementation integrated into a model-agnostic training framework, enabling the approach to be applied across different base LLM architectures at 1B and 8B scales
  3. Train and evaluate VocalNet models against open-source speech LLMs and proprietary Omni LLMs, validating latency reduction and generation quality improvements across benchmarks

System Components

Multi-Token Prediction (MTP) Head

Replaces standard next-token prediction by predicting multiple future speech tokens simultaneously, reducing autoregressive steps and improving generation speed and quality

Scalable Model-Agnostic Training Framework

A reusable training pipeline that supports different base LLM architectures and scales (1B/8B), enabling flexible deployment and community adoption

VocalNet-1B and VocalNet-8B

Two trained speech LLM variants optimized for real-time voice interaction, offering a tradeoff between computational cost and generation quality

Open-Source Release Package

Publicly available model weights, inference code, training datasets, and framework code hosted on GitHub to enable reproducibility and community research

Results

Metric/Benchmark Open-Source Speech LLMs VocalNet Delta
Speech Generation Quality Lower (existing SOTA open-source) Significantly higher Substantial improvement
Inference Latency Higher (NTP baseline) Lower (MTP-enabled) Reduced latency
vs. Proprietary Omni LLMs Below par On par Competitive parity with limited data
Parameter Scale Varies 1B and 8B Scalable coverage

Key Takeaways

  • MTP is a practical drop-in paradigm shift for speech LLMs that practitioners can adopt to reduce inference latency without sacrificing—and potentially improving—generation quality
  • The model-agnostic training framework means teams can apply VocalNet's MTP approach to their own base LLMs, making this broadly applicable beyond the specific checkpoints released
  • The full open-source release (weights, data, code) makes VocalNet an immediately usable baseline for real-time voice interaction research, lowering the barrier to reproducing and extending state-of-the-art speech LLM results

Abstract

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet

Generated on 2026-03-02 using Claude