MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
Problem Statement
Current multimodal LLMs process different modality representations (speech vs. text) with identical shared parameters, ignoring the fundamentally different statistical and structural properties of each modality. This one-size-fits-all approach limits modality-specific learning and cross-modal transfer. Additionally, most competitive speech-text LLMs rely on proprietary or closed datasets, limiting reproducibility and community adoption.
Key Novelty
- Modality-Aware Mixture of Experts (MAMoE): a routing mechanism that directs tokens to modality-specific expert groups or shared experts based on input type, enabling specialized and joint learning simultaneously
- First fully open-source speech-text LLM built on a Mixture of Experts backbone, with all models, training code, inference code, and training data publicly released
- An efficient post-training pipeline that adapts a pretrained MoE LLM using only open-source ASR/TTS datasets followed by speech-text instruction fine-tuning, achieving data efficiency without proprietary resources
Evaluation Highlights
- MoST consistently outperforms existing models of comparable parameter counts across ASR, TTS, audio language modeling, and spoken question answering benchmarks
- Ablation studies confirm that both modality-specific routing and shared experts independently contribute meaningful performance gains across all evaluated domains
Breakthrough Assessment
Methodology
- Start from a pretrained MoE language model and introduce MAMoE by partitioning experts into modality-specific groups (speech experts, text experts) and shared experts, with a modality-aware router that assigns tokens based on their input modality type
- Perform strategic post-training on open-source ASR and TTS datasets to teach the model speech-text alignment and generation capabilities while preserving the pretrained language model's knowledge
- Fine-tune the resulting model on a carefully curated open-source speech-text instruction dataset to enable instruction-following, spoken question answering, and audio language modeling capabilities
System Components
Routes incoming tokens to appropriate expert pathways based on whether the token originates from speech or text input, enabling modality-conditioned computation
Dedicated sets of feed-forward experts that specialize in capturing domain-specific patterns unique to speech or text representations
A complementary set of experts accessible by both modalities that facilitate cross-modal information transfer and joint representation learning
An intermediate training phase using open-source automatic speech recognition and text-to-speech datasets to establish speech-text grounding before instruction fine-tuning
A curated open-source collection used for fine-tuning to enable spoken QA, audio language modeling, and general instruction-following across modalities
Results
| Benchmark | Comparable-Size Baselines | MoST | Delta |
|---|---|---|---|
| ASR (speech recognition) | Prior comparable-param models | Outperforms | Positive improvement |
| TTS (speech synthesis quality) | Prior comparable-param models | Outperforms | Positive improvement |
| Audio Language Modeling | Prior comparable-param models | Outperforms | Positive improvement |
| Spoken Question Answering | Prior comparable-param models | Outperforms | Positive improvement |
| Ablation: w/o modality-specific routing | Full MAMoE model | Degraded | Negative delta |
| Ablation: w/o shared experts | Full MAMoE model | Degraded | Negative delta |
Key Takeaways
- Modality-aware expert routing is a practical and effective design pattern for multimodal MoE LLMs — practitioners building speech-text or other multimodal systems should consider partitioning experts by modality rather than using fully shared parameters
- A two-stage training pipeline (modality alignment post-training → instruction fine-tuning) on fully open-source data is sufficient to achieve competitive multimodal performance, making strong speech-text LLMs more accessible to researchers without proprietary data access
- The fully open-source release (model weights, training/inference code, datasets) provides a strong reproducible baseline for the community to build upon for speech-text multimodal research and downstream applications
Abstract
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST