← Back to Papers

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang
arXiv.org | 2025
SALMONN-omni is the first standalone full-duplex speech LLM that achieves natural human-machine conversation without injecting audio codecs into its token space, using a novel dynamic thinking mechanism to manage speaking/listening state transitions.

Problem Statement

Existing full-duplex conversational AI systems rely on modular pipelines with components like VADs, interrupters, and multiple LLMs, causing error accumulation and poor handling of context-dependent barge-in and echo cancellation. Codec-injection approaches like Moshi simplify the pipeline but suffer significant performance degradation when operating in the speech modality compared to text. A unified, codec-free architecture that maintains high speech understanding quality while supporting real-time duplex interaction has been missing.

Key Novelty

  • First standalone full-duplex speech LLM operating without audio codecs in the token space, preserving speech understanding quality
  • Novel dynamic thinking mechanism within the LLM backbone that learns when to transition between speaking and listening states in a data-driven manner
  • Reinforcement learning integration to further improve complex conversational behaviors including turn-taking, backchanneling, echo cancellation, and context-dependent barge-in

Evaluation Highlights

  • At least 30% relative performance improvement over existing open-source full-duplex models on spoken QA and open-domain dialogue benchmarks
  • Performs highly competitively with half-duplex and turn-based systems despite using substantially less training data, demonstrating strong data efficiency

Breakthrough Assessment

7/10 SALMONN-omni represents a significant architectural advance by eliminating codec injection while achieving full-duplex capability and matching turn-based system quality — a combination previously considered a hard trade-off. However, full-duplex speech LLMs are an emerging but active area, and the paper builds on existing foundations like SALMONN.

Methodology

  1. Build on a speech LLM backbone (SALMONN heritage) that encodes audio directly without discretizing into codec tokens, preserving continuous speech representations for higher-quality understanding
  2. Introduce a dynamic thinking mechanism that allows the LLM to internally reason about conversational state, deciding when to switch between listening and speaking modes without external state predictors or VADs
  3. Apply reinforcement learning fine-tuning to optimize complex conversational behaviors such as context-dependent barge-in, echo cancellation, turn-taking, and backchanneling beyond what supervised training achieves

System Components

Codec-free Speech Encoder

Encodes audio input as continuous features directly into the LLM without discretizing via audio codecs, preserving acoustic and semantic fidelity

Dynamic Thinking Mechanism

An in-context reasoning module within the LLM backbone that learns to predict speaking/listening state transitions, replacing external VADs or conversation state predictors

Full-duplex LLM Backbone

Single unified LLM that simultaneously handles speech input and output streams, enabling real-time bidirectional conversation without multiple model components

Reinforcement Learning Module

Post-training RL stage that further improves nuanced conversational behaviors including barge-in, echo cancellation, and backchanneling through reward-based optimization

Results

Benchmark/Scenario Best Open-Source Full-Duplex (e.g., Moshi) SALMONN-omni Delta
Spoken QA Benchmarks Baseline full-duplex SOTA ≥30% relative improvement +30% relative
Open-domain Dialogue Baseline full-duplex SOTA ≥30% relative improvement +30% relative
vs. Half-duplex/Turn-based Systems Full performance with text pipeline Highly competitive Near parity with less data
Complex Conversational Scenarios (barge-in, echo cancel, backchannel) Weak in modular systems Strong performance + RL gains Qualitative improvement

Key Takeaways

  • Codec injection is not necessary for full-duplex speech LLMs — continuous speech representations can be used directly, significantly improving speech understanding quality while maintaining real-time duplex capability
  • A dynamic thinking mechanism trained end-to-end can replace brittle modular components (VADs, state predictors) for conversational state management, reducing error accumulation in speech pipelines
  • Reinforcement learning is a promising post-training strategy for improving subtle conversational behaviors (barge-in, echo cancellation) that are hard to capture with supervised learning alone, offering a practical path to more natural AI speech agents

Abstract

In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

Generated on 2026-04-01 using Claude