DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, C. Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, C. Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

arXiv.org | 2025

Semantic Scholar

LLMs Multimodal Speech & Voice Prompt Engineering Benchmark

DeSTA2.5-Audio introduces a self-generated cross-modal alignment strategy (DeSTA) where the backbone LLM generates its own training targets, enabling a general-purpose Large Audio Language Model that preserves language proficiency while achieving strong auditory understanding without task-specific tuning.

Problem Statement

Existing Large Audio Language Models (LALMs) suffer from catastrophic forgetting of the LLM's original language abilities when fine-tuned on large-scale audio-instruction datasets. Current approaches rely on manually curated or LLM-synthesized datasets that introduce distributional mismatches, degrading instruction-following capabilities. There is a critical need for data construction strategies that align audio and text modalities without sacrificing the LLM's native language competence.

Key Novelty

DeSTA self-generated cross-modal alignment: the backbone LLM generates its own training targets from audio metadata/captions, ensuring training distributions match the LLM's native output space and mitigating catastrophic forgetting
DeSTA-AQA5M: a large-scale, task-agnostic dataset of 5 million training samples from 7,000 hours of audio across 50 diverse datasets spanning speech, environmental sounds, and music
Zero-shot generalization without task-specific audio instruction-tuning, achieving state-of-the-art performance across diverse benchmarks through a single unified training regime

Evaluation Highlights

State-of-the-art or competitive performance on Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench benchmarks covering auditory perception, reasoning, and instruction-following
Comprehensive ablation studies demonstrating that self-generated data construction outperforms both manually curated and LLM-synthesized data strategies on both auditory perception and instruction-following metrics

Breakthrough Assessment

7/10 The self-generated alignment strategy is a principled and practically impactful solution to catastrophic forgetting in multimodal LLM training, with strong empirical validation across diverse benchmarks and a large released dataset; however, the core idea of LLM-generated training targets is an extension of known self-distillation concepts rather than a fundamental paradigm shift.

Methodology

Collect 7,000 hours of diverse audio from 50 datasets (speech, environmental sounds, music) and pair with existing metadata, transcripts, and captions as seed information
Use the backbone LLM itself to generate training question-answer targets from the seed text information (self-generated cross-modal alignment / DeSTA), constructing the DeSTA-AQA5M dataset of 5M task-agnostic samples
Train an audio encoder + LLM architecture on DeSTA-AQA5M using the self-generated targets, then evaluate zero-shot on diverse audio-language benchmarks without any task-specific fine-tuning

System Components

DeSTA (Self-Generated Cross-Modal Alignment)

A data construction strategy where the backbone LLM generates its own QA training targets from audio-associated text metadata, ensuring training outputs lie within the LLM's native distribution and preventing catastrophic forgetting

DeSTA-AQA5M Dataset

A 5-million sample, task-agnostic audio-QA dataset built from 7,000 hours of audio across 50 datasets covering speech, environmental sounds, and music, designed for general-purpose LALM training

Audio Encoder

Encodes raw audio into representations that are fed into the LLM, bridging the acoustic and language modalities

Backbone LLM

The large language model that serves dual roles: generating training targets during data construction and performing audio-grounded language understanding at inference time

Zero-Shot Evaluation Framework

Benchmarks including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench used to assess generalization without task-specific tuning

Results

Benchmark	Prior SOTA / Baseline	DeSTA2.5-Audio	Delta
Dynamic-SUPERB	Competitive prior LALMs	State-of-the-art	Improvement
MMAU	Competitive prior LALMs	State-of-the-art or competitive	Improvement
SAKURA	Competitive prior LALMs	State-of-the-art or competitive	Improvement
Speech-IFEval	LALMs with task-specific tuning	State-of-the-art	Significant improvement
VoiceBench	Competitive prior LALMs	State-of-the-art or competitive	Improvement

Key Takeaways

Using the LLM itself to generate training targets (self-generated alignment) is a practical and effective technique to prevent catastrophic forgetting when extending LLMs to new modalities like audio
Task-agnostic, diverse data construction at scale (DeSTA-AQA5M: 5M samples, 50 datasets) can enable strong zero-shot generalization, reducing the need for expensive task-specific audio instruction datasets
Data construction strategy matters as much as model architecture: carefully aligning the training target distribution with the LLM's native output space yields better auditory perception and instruction-following than simply scaling up curated or synthetically augmented datasets

Abstract

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

Generated on 2026-03-03 using Claude