VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

VocalNet introduces multi-token prediction (MTP) to speech LLMs for the first time, enabling simultaneous improvements in generation speed and output quality for real-time voice interaction at 1B and 8B parameter scales.

Problem Statement

Speech LLMs have traditionally relied on next-token prediction (NTP), which creates a speed-quality tradeoff bottleneck limiting real-time voice interaction. Existing open-source speech LLMs lag behind proprietary Omni LLMs in performance, and inference latency remains a critical barrier for deployment. There is also a lack of reproducible, scalable training frameworks accessible to the research community.

Key Novelty

First application of multi-token prediction (MTP) to speech LLMs, breaking the assumption that NTP is the only viable autoregressive paradigm for speech generation
A scalable, model-agnostic training framework for real-time speech LLMs supporting both 1B and 8B parameter scales
Fully open-source release including model weights, inference code, training data, and framework implementation to foster reproducibility

Evaluation Highlights

VocalNet performs on par with mainstream proprietary Omni LLMs despite using limited training data, demonstrating strong data efficiency
VocalNet significantly surpasses existing open-source speech LLMs on generation quality benchmarks while also achieving lower latency than standard NTP-based approaches

Breakthrough Assessment

7/10 Applying MTP to speech LLMs is a meaningful architectural innovation that simultaneously improves speed and quality—two traditionally competing objectives—and the fully open-source release amplifies community impact, but the core MTP concept is adapted from text LLM research rather than invented from scratch.

Methodology

Analyze the effect of multi-token prediction on speech generation characteristics, identifying how predicting multiple tokens simultaneously benefits both speed and acoustic coherence in speech sequences
Design a straightforward MTP implementation integrated into a model-agnostic training framework, enabling the approach to be applied across different base LLM architectures at 1B and 8B scales
Train and evaluate VocalNet models against open-source speech LLMs and proprietary Omni LLMs, validating latency reduction and generation quality improvements across benchmarks

System Components

Multi-Token Prediction (MTP) Head

Replaces standard next-token prediction by predicting multiple future speech tokens simultaneously, reducing autoregressive steps and improving generation speed and quality

Scalable Model-Agnostic Training Framework

A reusable training pipeline that supports different base LLM architectures and scales (1B/8B), enabling flexible deployment and community adoption

VocalNet-1B and VocalNet-8B

Two trained speech LLM variants optimized for real-time voice interaction, offering a tradeoff between computational cost and generation quality

Open-Source Release Package

Publicly available model weights, inference code, training datasets, and framework code hosted on GitHub to enable reproducibility and community research

Results

Metric/Benchmark	Open-Source Speech LLMs	VocalNet	Delta
Speech Generation Quality	Lower (existing SOTA open-source)	Significantly higher	Substantial improvement
Inference Latency	Higher (NTP baseline)	Lower (MTP-enabled)	Reduced latency
vs. Proprietary Omni LLMs	Below par	On par	Competitive parity with limited data
Parameter Scale	Varies	1B and 8B	Scalable coverage

Key Takeaways

MTP is a practical drop-in paradigm shift for speech LLMs that practitioners can adopt to reduce inference latency without sacrificing—and potentially improving—generation quality
The model-agnostic training framework means teams can apply VocalNet's MTP approach to their own base LLMs, making this broadly applicable beyond the specific checkpoints released
The full open-source release (weights, data, code) makes VocalNet an immediately usable baseline for real-time voice interaction research, lowering the barrier to reproducing and extending state-of-the-art speech LLM results

Abstract

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet