Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

A novel duplex speech-to-speech (S2S) architecture that enables simultaneous, real-time bidirectional spoken dialogue—including barge-in—by fusing continuous user input streams with codec-based agent output streams through channel fusion, eliminating the need for speech pretraining.

Problem Statement

Current speech language models are largely constrained to turn-based exchanges, making them unable to handle real-world conversational dynamics like user interruptions (barge-in) or overlapping speech. Existing duplex models require expensive speech pretraining and operate at high bitrates, limiting accessibility and scalability. This work addresses these bottlenecks by enabling duplex S2S modeling directly from any pretrained LLM without speech-specific pretraining.

Key Novelty

First duplex S2S model that skips speech pretraining entirely by leveraging a pretrained streaming encoder for user input, enabling direct fine-tuning from any LLM
Channel fusion architecture that directly and simultaneously models both user and agent audio streams, supporting real-time barge-in and continuous interaction
Separate architectures for agent and user modeling enabling codec fine-tuning for improved agent voice quality and halving the bitrate to 0.6 kbps compared to prior work

Evaluation Highlights

Outperforms previous duplex models on reasoning, turn-taking, and barge-in benchmarks in head-to-head comparisons at Interspeech 2025
Achieves 0.6 kbps agent codec bitrate—roughly half of prior duplex S2S systems—while maintaining or improving quality

Breakthrough Assessment

7/10 Eliminating speech pretraining while achieving superior duplex performance significantly lowers the barrier to building real-time conversational AI from any LLM, representing a meaningful practical and architectural advance. However, it builds on existing codec and streaming encoder components rather than introducing fundamentally new learning paradigms.

Methodology

Step 1 — User Input Encoding: A pretrained streaming (causal) speech encoder continuously processes user audio in real time, producing continuous representations without requiring speech-specific LLM pretraining.
Step 2 — Channel Fusion & Duplex Modeling: User encoder outputs and agent codec token streams are fused via a channel fusion mechanism, allowing the LLM backbone to jointly model both simultaneous streams and make turn-taking or barge-in decisions.
Step 3 — Separate Agent Output Decoding: A dedicated codec decoder with fine-tuning generates agent speech tokens at a reduced 0.6 kbps bitrate, enabling high-quality agent voice synthesis decoupled from user modeling.

System Components

Pretrained Streaming Encoder

A causal, real-time speech encoder that processes continuous user audio input and produces representations directly usable by the LLM, bypassing the need for speech pretraining

Channel Fusion Module

Merges the continuous user audio stream with the agent codec output stream so the model can jointly reason about both channels simultaneously, enabling true duplex behavior

Separate Agent Codec Decoder

A dedicated codec-based speech synthesis module fine-tuned independently for the agent, improving voice quality and reducing bitrate to 0.6 kbps

LLM Backbone

A pretrained large language model that serves as the reasoning core, adapted for duplex S2S without requiring speech-domain pretraining

Barge-in & Turn-Taking Controller

Emergent capability from the duplex architecture that allows the model to detect and respond to user interruptions in real time

Results

Metric/Benchmark	Previous Duplex Models	This Paper	Delta
Reasoning Ability	Lower	Higher (best among duplex models)	Positive improvement
Turn-Taking Accuracy	Lower	Higher (best among duplex models)	Positive improvement
Barge-In Handling	Lower	Higher (best among duplex models)	Positive improvement
Agent Codec Bitrate	~1.2 kbps	0.6 kbps	~50% reduction
Speech Pretraining Required	Yes	No	Eliminated entirely

Key Takeaways

ML practitioners can now build duplex conversational speech models starting from any pretrained LLM without collecting large speech pretraining datasets, dramatically reducing development cost and complexity.
The 0.6 kbps codec bitrate and streaming encoder design make this architecture suitable for low-bandwidth, real-time deployment scenarios such as mobile or edge devices.
The fully open-sourced training and inference codebase provides a reproducible baseline for future duplex S2S research, lowering the barrier for the broader NLP/speech community to iterate on real-time spoken dialogue systems.

Abstract

Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.