DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
Problem Statement
Existing Large Audio Language Models (LALMs) suffer from catastrophic forgetting of the LLM's original language abilities when fine-tuned on large-scale audio-instruction datasets. Current approaches rely on manually curated or LLM-synthesized datasets that introduce distributional mismatches, degrading instruction-following capabilities. There is a critical need for data construction strategies that align audio and text modalities without sacrificing the LLM's native language competence.
Key Novelty
- DeSTA self-generated cross-modal alignment: the backbone LLM generates its own training targets from audio metadata/captions, ensuring training distributions match the LLM's native output space and mitigating catastrophic forgetting
- DeSTA-AQA5M: a large-scale, task-agnostic dataset of 5 million training samples from 7,000 hours of audio across 50 diverse datasets spanning speech, environmental sounds, and music
- Zero-shot generalization without task-specific audio instruction-tuning, achieving state-of-the-art performance across diverse benchmarks through a single unified training regime
Evaluation Highlights
- State-of-the-art or competitive performance on Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench benchmarks covering auditory perception, reasoning, and instruction-following
- Comprehensive ablation studies demonstrating that self-generated data construction outperforms both manually curated and LLM-synthesized data strategies on both auditory perception and instruction-following metrics
Breakthrough Assessment
Methodology
- Collect 7,000 hours of diverse audio from 50 datasets (speech, environmental sounds, music) and pair with existing metadata, transcripts, and captions as seed information
- Use the backbone LLM itself to generate training question-answer targets from the seed text information (self-generated cross-modal alignment / DeSTA), constructing the DeSTA-AQA5M dataset of 5M task-agnostic samples
- Train an audio encoder + LLM architecture on DeSTA-AQA5M using the self-generated targets, then evaluate zero-shot on diverse audio-language benchmarks without any task-specific fine-tuning
System Components
A data construction strategy where the backbone LLM generates its own QA training targets from audio-associated text metadata, ensuring training outputs lie within the LLM's native distribution and preventing catastrophic forgetting
A 5-million sample, task-agnostic audio-QA dataset built from 7,000 hours of audio across 50 datasets covering speech, environmental sounds, and music, designed for general-purpose LALM training
Encodes raw audio into representations that are fed into the LLM, bridging the acoustic and language modalities
The large language model that serves dual roles: generating training targets during data construction and performing audio-grounded language understanding at inference time
Benchmarks including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench used to assess generalization without task-specific tuning
Results
| Benchmark | Prior SOTA / Baseline | DeSTA2.5-Audio | Delta |
|---|---|---|---|
| Dynamic-SUPERB | Competitive prior LALMs | State-of-the-art | Improvement |
| MMAU | Competitive prior LALMs | State-of-the-art or competitive | Improvement |
| SAKURA | Competitive prior LALMs | State-of-the-art or competitive | Improvement |
| Speech-IFEval | LALMs with task-specific tuning | State-of-the-art | Significant improvement |
| VoiceBench | Competitive prior LALMs | State-of-the-art or competitive | Improvement |
Key Takeaways
- Using the LLM itself to generate training targets (self-generated alignment) is a practical and effective technique to prevent catastrophic forgetting when extending LLMs to new modalities like audio
- Task-agnostic, diverse data construction at scale (DeSTA-AQA5M: 5M samples, 50 datasets) can enable strong zero-shot generalization, reducing the need for expensive task-specific audio instruction datasets
- Data construction strategy matters as much as model architecture: carefully aligning the training target distribution with the LLM's native output space yields better auditory perception and instruction-following than simply scaling up curated or synthetically augmented datasets
Abstract
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.