MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

MoST introduces a Modality-Aware Mixture of Experts (MAMoE) architecture that routes speech and text tokens to specialized expert pathways, enabling a single multimodal LLM to achieve strong performance on both speech and text tasks simultaneously.

Problem Statement

Current multimodal LLMs process different modality representations (speech vs. text) with identical shared parameters, ignoring the fundamentally different statistical and structural properties of each modality. This one-size-fits-all approach limits modality-specific learning and cross-modal transfer. Additionally, most competitive speech-text LLMs rely on proprietary or closed datasets, limiting reproducibility and community adoption.

Key Novelty

Modality-Aware Mixture of Experts (MAMoE): a routing mechanism that directs tokens to modality-specific expert groups or shared experts based on input type, enabling specialized and joint learning simultaneously
First fully open-source speech-text LLM built on a Mixture of Experts backbone, with all models, training code, inference code, and training data publicly released
An efficient post-training pipeline that adapts a pretrained MoE LLM using only open-source ASR/TTS datasets followed by speech-text instruction fine-tuning, achieving data efficiency without proprietary resources

Evaluation Highlights

MoST consistently outperforms existing models of comparable parameter counts across ASR, TTS, audio language modeling, and spoken question answering benchmarks
Ablation studies confirm that both modality-specific routing and shared experts independently contribute meaningful performance gains across all evaluated domains

Breakthrough Assessment

7/10 MoST makes a significant architectural contribution by demonstrating that modality-aware routing in MoE models is a principled and effective solution to the heterogeneous representation problem in multimodal LLMs, while also advancing open-source accessibility in the speech-text LLM space; however, it is an evolutionary rather than paradigm-shifting advance.

Methodology

Start from a pretrained MoE language model and introduce MAMoE by partitioning experts into modality-specific groups (speech experts, text experts) and shared experts, with a modality-aware router that assigns tokens based on their input modality type
Perform strategic post-training on open-source ASR and TTS datasets to teach the model speech-text alignment and generation capabilities while preserving the pretrained language model's knowledge
Fine-tune the resulting model on a carefully curated open-source speech-text instruction dataset to enable instruction-following, spoken question answering, and audio language modeling capabilities

System Components

Modality-Aware Router

Routes incoming tokens to appropriate expert pathways based on whether the token originates from speech or text input, enabling modality-conditioned computation

Modality-Specific Expert Groups

Dedicated sets of feed-forward experts that specialize in capturing domain-specific patterns unique to speech or text representations

Shared Experts

A complementary set of experts accessible by both modalities that facilitate cross-modal information transfer and joint representation learning

ASR/TTS Post-Training Stage

An intermediate training phase using open-source automatic speech recognition and text-to-speech datasets to establish speech-text grounding before instruction fine-tuning

Speech-Text Instruction Dataset

A curated open-source collection used for fine-tuning to enable spoken QA, audio language modeling, and general instruction-following across modalities

Results

Benchmark	Comparable-Size Baselines	MoST	Delta
ASR (speech recognition)	Prior comparable-param models	Outperforms	Positive improvement
TTS (speech synthesis quality)	Prior comparable-param models	Outperforms	Positive improvement
Audio Language Modeling	Prior comparable-param models	Outperforms	Positive improvement
Spoken Question Answering	Prior comparable-param models	Outperforms	Positive improvement
Ablation: w/o modality-specific routing	Full MAMoE model	Degraded	Negative delta
Ablation: w/o shared experts	Full MAMoE model	Degraded	Negative delta

Key Takeaways

Modality-aware expert routing is a practical and effective design pattern for multimodal MoE LLMs — practitioners building speech-text or other multimodal systems should consider partitioning experts by modality rather than using fully shared parameters
A two-stage training pipeline (modality alignment post-training → instruction fine-tuning) on fully open-source data is sufficient to achieve competitive multimodal performance, making strong speech-text LLMs more accessible to researchers without proprietary data access
The fully open-source release (model weights, training/inference code, datasets) provides a strong reproducible baseline for the community to build upon for speech-text multimodal research and downstream applications

Abstract

We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST