VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
Problem Statement
Speech LLMs have traditionally relied on next-token prediction (NTP), which creates a speed-quality tradeoff bottleneck limiting real-time voice interaction. Existing open-source speech LLMs lag behind proprietary Omni LLMs in performance, and inference latency remains a critical barrier for deployment. There is also a lack of reproducible, scalable training frameworks accessible to the research community.
Key Novelty
- First application of multi-token prediction (MTP) to speech LLMs, breaking the assumption that NTP is the only viable autoregressive paradigm for speech generation
- A scalable, model-agnostic training framework for real-time speech LLMs supporting both 1B and 8B parameter scales
- Fully open-source release including model weights, inference code, training data, and framework implementation to foster reproducibility
Evaluation Highlights
- VocalNet performs on par with mainstream proprietary Omni LLMs despite using limited training data, demonstrating strong data efficiency
- VocalNet significantly surpasses existing open-source speech LLMs on generation quality benchmarks while also achieving lower latency than standard NTP-based approaches
Breakthrough Assessment
Methodology
- Analyze the effect of multi-token prediction on speech generation characteristics, identifying how predicting multiple tokens simultaneously benefits both speed and acoustic coherence in speech sequences
- Design a straightforward MTP implementation integrated into a model-agnostic training framework, enabling the approach to be applied across different base LLM architectures at 1B and 8B scales
- Train and evaluate VocalNet models against open-source speech LLMs and proprietary Omni LLMs, validating latency reduction and generation quality improvements across benchmarks
System Components
Replaces standard next-token prediction by predicting multiple future speech tokens simultaneously, reducing autoregressive steps and improving generation speed and quality
A reusable training pipeline that supports different base LLM architectures and scales (1B/8B), enabling flexible deployment and community adoption
Two trained speech LLM variants optimized for real-time voice interaction, offering a tradeoff between computational cost and generation quality
Publicly available model weights, inference code, training datasets, and framework code hosted on GitHub to enable reproducibility and community research
Results
| Metric/Benchmark | Open-Source Speech LLMs | VocalNet | Delta |
|---|---|---|---|
| Speech Generation Quality | Lower (existing SOTA open-source) | Significantly higher | Substantial improvement |
| Inference Latency | Higher (NTP baseline) | Lower (MTP-enabled) | Reduced latency |
| vs. Proprietary Omni LLMs | Below par | On par | Competitive parity with limited data |
| Parameter Scale | Varies | 1B and 8B | Scalable coverage |
Key Takeaways
- MTP is a practical drop-in paradigm shift for speech LLMs that practitioners can adopt to reduce inference latency without sacrificing—and potentially improving—generation quality
- The model-agnostic training framework means teams can apply VocalNet's MTP approach to their own base LLMs, making this broadly applicable beyond the specific checkpoints released
- The full open-source release (weights, data, code) makes VocalNet an immediately usable baseline for real-time voice interaction research, lowering the barrier to reproducing and extending state-of-the-art speech LLM results
Abstract
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet