SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

SALMONN-omni is the first standalone full-duplex speech LLM that achieves natural human-machine conversation without injecting audio codecs into its token space, using a novel dynamic thinking mechanism to manage speaking/listening state transitions.

Problem Statement

Existing full-duplex conversational AI systems rely on modular pipelines with components like VADs, interrupters, and multiple LLMs, causing error accumulation and poor handling of context-dependent barge-in and echo cancellation. Codec-injection approaches like Moshi simplify the pipeline but suffer significant performance degradation when operating in the speech modality compared to text. A unified, codec-free architecture that maintains high speech understanding quality while supporting real-time duplex interaction has been missing.

Key Novelty

First standalone full-duplex speech LLM operating without audio codecs in the token space, preserving speech understanding quality
Novel dynamic thinking mechanism within the LLM backbone that learns when to transition between speaking and listening states in a data-driven manner
Reinforcement learning integration to further improve complex conversational behaviors including turn-taking, backchanneling, echo cancellation, and context-dependent barge-in

Evaluation Highlights

At least 30% relative performance improvement over existing open-source full-duplex models on spoken QA and open-domain dialogue benchmarks
Performs highly competitively with half-duplex and turn-based systems despite using substantially less training data, demonstrating strong data efficiency

Breakthrough Assessment

7/10 SALMONN-omni represents a significant architectural advance by eliminating codec injection while achieving full-duplex capability and matching turn-based system quality — a combination previously considered a hard trade-off. However, full-duplex speech LLMs are an emerging but active area, and the paper builds on existing foundations like SALMONN.

Methodology

Build on a speech LLM backbone (SALMONN heritage) that encodes audio directly without discretizing into codec tokens, preserving continuous speech representations for higher-quality understanding
Introduce a dynamic thinking mechanism that allows the LLM to internally reason about conversational state, deciding when to switch between listening and speaking modes without external state predictors or VADs
Apply reinforcement learning fine-tuning to optimize complex conversational behaviors such as context-dependent barge-in, echo cancellation, turn-taking, and backchanneling beyond what supervised training achieves

System Components

Codec-free Speech Encoder

Encodes audio input as continuous features directly into the LLM without discretizing via audio codecs, preserving acoustic and semantic fidelity

Dynamic Thinking Mechanism

An in-context reasoning module within the LLM backbone that learns to predict speaking/listening state transitions, replacing external VADs or conversation state predictors

Full-duplex LLM Backbone

Single unified LLM that simultaneously handles speech input and output streams, enabling real-time bidirectional conversation without multiple model components

Reinforcement Learning Module

Post-training RL stage that further improves nuanced conversational behaviors including barge-in, echo cancellation, and backchanneling through reward-based optimization

Results

Benchmark/Scenario	Best Open-Source Full-Duplex (e.g., Moshi)	SALMONN-omni	Delta
Spoken QA Benchmarks	Baseline full-duplex SOTA	≥30% relative improvement	+30% relative
Open-domain Dialogue	Baseline full-duplex SOTA	≥30% relative improvement	+30% relative
vs. Half-duplex/Turn-based Systems	Full performance with text pipeline	Highly competitive	Near parity with less data
Complex Conversational Scenarios (barge-in, echo cancel, backchannel)	Weak in modular systems	Strong performance + RL gains	Qualitative improvement

Key Takeaways

Codec injection is not necessary for full-duplex speech LLMs — continuous speech representations can be used directly, significantly improving speech understanding quality while maintaining real-time duplex capability
A dynamic thinking mechanism trained end-to-end can replace brittle modular components (VADs, state predictors) for conversational state management, reducing error accumulation in speech pipelines
Reinforcement learning is a promising post-training strategy for improving subtle conversational behaviors (barge-in, echo cancellation) that are hard to capture with supervised learning alone, offering a practical path to more natural AI speech agents

Abstract

In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.