benchmark 2025

Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

Sandipana Dowerah ¹, Atharva Kulkarni ², Ajinkya Kulkarni ³, Hoan My Tran ⁴, Joonas Kalda ¹, Artem Fedorchenko ¹, Benoit Fauve ⁵, Damien Lolive ⁶, Tanel Alumäe ¹, Matthew Magimai Doss ³

¹ Tallinn University of Technology

² MBZUAI

³ Idiap Research Institute

⁴ Univ. Rennes

⁵ Validsoft Ltd.

⁶ Univ. Bretagne Sud

0 citations

Published on arXiv

2509.02859

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Most evaluated SOTA systems exhibit high equal error rates in out-of-domain scenarios, revealing that current audio deepfake detectors lack robust cross-domain generalization.

Speech DF Arena

Novel technique introduced

Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.

Key Contributions

First comprehensive benchmark (Speech DF Arena) evaluating audio deepfake detection across 14 diverse datasets with standardized metrics (EER, pooled EER, accuracy, F1) for reproducibility
Hosted leaderboard on HuggingFace covering 12 open-source and 3 proprietary SOTA detection systems with a reproducible evaluation toolkit on GitHub
Empirical finding that most SOTA detectors exhibit high EER in out-of-domain scenarios, exposing a critical generalization gap between lab benchmarks and real-world conditions

🛡️ Threat Analysis

Output Integrity Attack

Audio deepfake detection is canonically ML09 (detecting AI-generated content / output integrity). This paper establishes a standardized evaluation benchmark for exactly this threat — measuring how well detection systems identify synthetic or voice-converted speech across diverse attack scenarios and datasets.

Details

Domains

audio

Model Types

transformercnn

Threat Tags

inference_timeblack_box

Datasets

ASVspoof 2019ASVspoof 2021ADD ChallengeFoRITWSONARCtrSVDDVoiceWukong

Applications

audio deepfake detectionsynthetic speech detectionvoice conversion detection

Read PDF arXiv Code

Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

AI-Generated Music Detection in Broadcast Monitoring

Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems

Probabilistic Verification of Voice Anti-Spoofing Models

Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes