Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System

The SAFE Challenge evaluates synthetic speech detection across three tasks: unmodified audio, processed audio with compression artifacts, and laundered audio designed to evade detection. We systematically explore self-supervised learning (SSL) front-ends, training data compositions, and audio length configurations for robust deepfake detection. Our AASIST-based approach incorporates WavLM large frontend with RawBoost augmentation, trained on a multilingual dataset of 256,600 samples spanning 9 languages and over 70 TTS systems from CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, and MAILABS. Through extensive experimentation with different SSL front-ends, three training data versions, and two audio lengths, we achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness.

Key Contributions

Systematic empirical evaluation of multilingual dataset integration strategies (CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, MAILABS) for training robust audio deepfake detectors
Comparison of SSL front-ends (WavLM Large and others), audio length configurations, and training data compositions across three SAFE Challenge tasks
Source-level vulnerability analysis revealing failure patterns for specific TTS systems and laundering techniques

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses detection of AI-generated synthetic audio (audio deepfakes), including laundered audio designed to evade detection — core output integrity and content authenticity problem under ML09.