benchmark arXiv Sep 2, 2025 · Sep 2025
Sandipana Dowerah, Atharva Kulkarni, Ajinkya Kulkarni et al. · Tallinn University of Technology · MBZUAI +4 more
Benchmarks 15 audio deepfake detectors across 14 datasets, exposing severe cross-domain generalization failures
Output Integrity Attack audio
Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.
transformer cnn Tallinn University of Technology · MBZUAI · Idiap Research Institute +3 more
benchmark arXiv Mar 6, 2026 · 4w ago
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni et al. · Idiap Research Institute · Tallinn University of Technology +1 more
Controlled study benchmarking compact SSL backbones for audio deepfake detection with TTA-based uncertainty calibration
Output Integrity Attack audio
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
transformer Idiap Research Institute · Tallinn University of Technology · MBZUAI