SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation
Problem Statement
Existing full-duplex conversational AI systems rely on modular pipelines with components like VADs, interrupters, and multiple LLMs, causing error accumulation and poor handling of context-dependent barge-in and echo cancellation. Codec-injection approaches like Moshi simplify the pipeline but suffer significant performance degradation when operating in the speech modality compared to text. A unified, codec-free architecture that maintains high speech understanding quality while supporting real-time duplex interaction has been missing.
Key Novelty
- First standalone full-duplex speech LLM operating without audio codecs in the token space, preserving speech understanding quality
- Novel dynamic thinking mechanism within the LLM backbone that learns when to transition between speaking and listening states in a data-driven manner
- Reinforcement learning integration to further improve complex conversational behaviors including turn-taking, backchanneling, echo cancellation, and context-dependent barge-in
Evaluation Highlights
- At least 30% relative performance improvement over existing open-source full-duplex models on spoken QA and open-domain dialogue benchmarks
- Performs highly competitively with half-duplex and turn-based systems despite using substantially less training data, demonstrating strong data efficiency
Breakthrough Assessment
Methodology
- Build on a speech LLM backbone (SALMONN heritage) that encodes audio directly without discretizing into codec tokens, preserving continuous speech representations for higher-quality understanding
- Introduce a dynamic thinking mechanism that allows the LLM to internally reason about conversational state, deciding when to switch between listening and speaking modes without external state predictors or VADs
- Apply reinforcement learning fine-tuning to optimize complex conversational behaviors such as context-dependent barge-in, echo cancellation, turn-taking, and backchanneling beyond what supervised training achieves
System Components
Encodes audio input as continuous features directly into the LLM without discretizing via audio codecs, preserving acoustic and semantic fidelity
An in-context reasoning module within the LLM backbone that learns to predict speaking/listening state transitions, replacing external VADs or conversation state predictors
Single unified LLM that simultaneously handles speech input and output streams, enabling real-time bidirectional conversation without multiple model components
Post-training RL stage that further improves nuanced conversational behaviors including barge-in, echo cancellation, and backchanneling through reward-based optimization
Results
| Benchmark/Scenario | Best Open-Source Full-Duplex (e.g., Moshi) | SALMONN-omni | Delta |
|---|---|---|---|
| Spoken QA Benchmarks | Baseline full-duplex SOTA | ≥30% relative improvement | +30% relative |
| Open-domain Dialogue | Baseline full-duplex SOTA | ≥30% relative improvement | +30% relative |
| vs. Half-duplex/Turn-based Systems | Full performance with text pipeline | Highly competitive | Near parity with less data |
| Complex Conversational Scenarios (barge-in, echo cancel, backchannel) | Weak in modular systems | Strong performance + RL gains | Qualitative improvement |
Key Takeaways
- Codec injection is not necessary for full-duplex speech LLMs — continuous speech representations can be used directly, significantly improving speech understanding quality while maintaining real-time duplex capability
- A dynamic thinking mechanism trained end-to-end can replace brittle modular components (VADs, state predictors) for conversational state management, reducing error accumulation in speech pipelines
- Reinforcement learning is a promising post-training strategy for improving subtle conversational behaviors (barge-in, echo cancellation) that are hard to capture with supervised learning alone, offering a practical path to more natural AI speech agents
Abstract
In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.