← Back to Papers

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, C. Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, C. Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
arXiv.org | 2025
DeSTA2.5-Audio introduces a self-generated cross-modal alignment strategy (DeSTA) where the backbone LLM generates its own training targets, enabling a general-purpose Large Audio Language Model that preserves language proficiency while achieving strong auditory understanding without task-specific tuning.

Problem Statement

Existing Large Audio Language Models (LALMs) suffer from catastrophic forgetting of the LLM's original language abilities when fine-tuned on large-scale audio-instruction datasets. Current approaches rely on manually curated or LLM-synthesized datasets that introduce distributional mismatches, degrading instruction-following capabilities. There is a critical need for data construction strategies that align audio and text modalities without sacrificing the LLM's native language competence.

Key Novelty

  • DeSTA self-generated cross-modal alignment: the backbone LLM generates its own training targets from audio metadata/captions, ensuring training distributions match the LLM's native output space and mitigating catastrophic forgetting
  • DeSTA-AQA5M: a large-scale, task-agnostic dataset of 5 million training samples from 7,000 hours of audio across 50 diverse datasets spanning speech, environmental sounds, and music
  • Zero-shot generalization without task-specific audio instruction-tuning, achieving state-of-the-art performance across diverse benchmarks through a single unified training regime

Evaluation Highlights

  • State-of-the-art or competitive performance on Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench benchmarks covering auditory perception, reasoning, and instruction-following
  • Comprehensive ablation studies demonstrating that self-generated data construction outperforms both manually curated and LLM-synthesized data strategies on both auditory perception and instruction-following metrics

Breakthrough Assessment

7/10 The self-generated alignment strategy is a principled and practically impactful solution to catastrophic forgetting in multimodal LLM training, with strong empirical validation across diverse benchmarks and a large released dataset; however, the core idea of LLM-generated training targets is an extension of known self-distillation concepts rather than a fundamental paradigm shift.

Methodology

  1. Collect 7,000 hours of diverse audio from 50 datasets (speech, environmental sounds, music) and pair with existing metadata, transcripts, and captions as seed information
  2. Use the backbone LLM itself to generate training question-answer targets from the seed text information (self-generated cross-modal alignment / DeSTA), constructing the DeSTA-AQA5M dataset of 5M task-agnostic samples
  3. Train an audio encoder + LLM architecture on DeSTA-AQA5M using the self-generated targets, then evaluate zero-shot on diverse audio-language benchmarks without any task-specific fine-tuning

System Components

DeSTA (Self-Generated Cross-Modal Alignment)

A data construction strategy where the backbone LLM generates its own QA training targets from audio-associated text metadata, ensuring training outputs lie within the LLM's native distribution and preventing catastrophic forgetting

DeSTA-AQA5M Dataset

A 5-million sample, task-agnostic audio-QA dataset built from 7,000 hours of audio across 50 datasets covering speech, environmental sounds, and music, designed for general-purpose LALM training

Audio Encoder

Encodes raw audio into representations that are fed into the LLM, bridging the acoustic and language modalities

Backbone LLM

The large language model that serves dual roles: generating training targets during data construction and performing audio-grounded language understanding at inference time

Zero-Shot Evaluation Framework

Benchmarks including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench used to assess generalization without task-specific tuning

Results

Benchmark Prior SOTA / Baseline DeSTA2.5-Audio Delta
Dynamic-SUPERB Competitive prior LALMs State-of-the-art Improvement
MMAU Competitive prior LALMs State-of-the-art or competitive Improvement
SAKURA Competitive prior LALMs State-of-the-art or competitive Improvement
Speech-IFEval LALMs with task-specific tuning State-of-the-art Significant improvement
VoiceBench Competitive prior LALMs State-of-the-art or competitive Improvement

Key Takeaways

  • Using the LLM itself to generate training targets (self-generated alignment) is a practical and effective technique to prevent catastrophic forgetting when extending LLMs to new modalities like audio
  • Task-agnostic, diverse data construction at scale (DeSTA-AQA5M: 5M samples, 50 datasets) can enable strong zero-shot generalization, reducing the need for expensive task-specific audio instruction datasets
  • Data construction strategy matters as much as model architecture: carefully aligning the training target distribution with the LLM's native output space yields better auditory perception and instruction-following than simply scaling up curated or synthetically augmented datasets

Abstract

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

Generated on 2026-03-03 using Claude