AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Speech synthesis systems can now produce highly realistic vocalisations that pose significant authenticity challenges. Despite substantial progress in deepfake detection models, their real-world effectiveness is often undermined by evolving distribution shifts between training and test data, driven by the complexity of human speech and the rapid evolution of synthesis systems. Existing datasets suffer from limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous mixtures of deepfake sources, which hinder systematic evaluation and open-world model training. To address these issues, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale and highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders, totalling 3 million clips. We further observe that most existing detectors default to binary supervised training, which can induce negative transfer across synthesis sources when the training data contains highly diverse deepfake patterns, impacting overall generalisation. As a complementary contribution, we propose an effective curriculum-learning-based approach to mitigate this effect. Extensive experiments show that existing detection models struggle to generalise to novel deepfakes and human speech in AUDETER, whereas XLR-based detectors trained on AUDETER achieve strong cross-domain performance across multiple benchmarks, achieving an EER of 1.87% on In-the-Wild. AUDETER is available on GitHub.

Key Contributions

AUDETER: a large-scale deepfake audio dataset of 4,500+ hours and 3M clips synthesized by 11 TTS models and 10 vocoders, addressing gaps in real speech diversity and synthesis system coverage
Empirical finding that binary supervised training on heterogeneous deepfake sources induces negative transfer, degrading generalisation
Curriculum-learning-based training strategy to mitigate negative transfer, enabling XLR-based detectors to achieve 1.87% EER on the In-the-Wild benchmark

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated audio content detection — building a benchmark for evaluating deepfake audio detectors and improving their generalization falls squarely under output integrity and authenticity of AI-generated content.

Details

Domains

audio

Model Types

transformer

Threat Tags

inference_time

Datasets

AUDETERIn-the-WildASVspoof

Applications

2025 0 cit.

Output Integrity Attack

89%

AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning

Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

On Deepfake Voice Detection -- It's All in the Presentation

Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits