Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
Problem Statement
Current speech language models are largely constrained to turn-based exchanges, making them unable to handle real-world conversational dynamics like user interruptions (barge-in) or overlapping speech. Existing duplex models require expensive speech pretraining and operate at high bitrates, limiting accessibility and scalability. This work addresses these bottlenecks by enabling duplex S2S modeling directly from any pretrained LLM without speech-specific pretraining.
Key Novelty
- First duplex S2S model that skips speech pretraining entirely by leveraging a pretrained streaming encoder for user input, enabling direct fine-tuning from any LLM
- Channel fusion architecture that directly and simultaneously models both user and agent audio streams, supporting real-time barge-in and continuous interaction
- Separate architectures for agent and user modeling enabling codec fine-tuning for improved agent voice quality and halving the bitrate to 0.6 kbps compared to prior work
Evaluation Highlights
- Outperforms previous duplex models on reasoning, turn-taking, and barge-in benchmarks in head-to-head comparisons at Interspeech 2025
- Achieves 0.6 kbps agent codec bitrate—roughly half of prior duplex S2S systems—while maintaining or improving quality
Breakthrough Assessment
Methodology
- Step 1 — User Input Encoding: A pretrained streaming (causal) speech encoder continuously processes user audio in real time, producing continuous representations without requiring speech-specific LLM pretraining.
- Step 2 — Channel Fusion & Duplex Modeling: User encoder outputs and agent codec token streams are fused via a channel fusion mechanism, allowing the LLM backbone to jointly model both simultaneous streams and make turn-taking or barge-in decisions.
- Step 3 — Separate Agent Output Decoding: A dedicated codec decoder with fine-tuning generates agent speech tokens at a reduced 0.6 kbps bitrate, enabling high-quality agent voice synthesis decoupled from user modeling.
System Components
A causal, real-time speech encoder that processes continuous user audio input and produces representations directly usable by the LLM, bypassing the need for speech pretraining
Merges the continuous user audio stream with the agent codec output stream so the model can jointly reason about both channels simultaneously, enabling true duplex behavior
A dedicated codec-based speech synthesis module fine-tuned independently for the agent, improving voice quality and reducing bitrate to 0.6 kbps
A pretrained large language model that serves as the reasoning core, adapted for duplex S2S without requiring speech-domain pretraining
Emergent capability from the duplex architecture that allows the model to detect and respond to user interruptions in real time
Results
| Metric/Benchmark | Previous Duplex Models | This Paper | Delta |
|---|---|---|---|
| Reasoning Ability | Lower | Higher (best among duplex models) | Positive improvement |
| Turn-Taking Accuracy | Lower | Higher (best among duplex models) | Positive improvement |
| Barge-In Handling | Lower | Higher (best among duplex models) | Positive improvement |
| Agent Codec Bitrate | ~1.2 kbps | 0.6 kbps | ~50% reduction |
| Speech Pretraining Required | Yes | No | Eliminated entirely |
Key Takeaways
- ML practitioners can now build duplex conversational speech models starting from any pretrained LLM without collecting large speech pretraining datasets, dramatically reducing development cost and complexity.
- The 0.6 kbps codec bitrate and streaming encoder design make this architecture suitable for low-bandwidth, real-time deployment scenarios such as mobile or edge devices.
- The fully open-sourced training and inference codebase provides a reproducible baseline for future duplex S2S research, lowering the barrier for the broader NLP/speech community to iterate on real-time spoken dialogue systems.
Abstract
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.