AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
Yi Zhu 1,2, Heitor R. Guimarães 1, Arthur Pimentel 1, Tiago Falk 1
Published on arXiv
2509.21597
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
A baseline detector pretrained on ASVspoof2019 exhibits dramatic performance variance across 28 datasets, with accuracy dropping to near-chance levels on several real-world-condition datasets, exposing critical generalization failures.
AUDDT
Novel technique introduced
With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real-world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.
Key Contributions
- Systematic review and taxonomy of 28 audio deepfake datasets covering diverse generation methods, languages, perturbations, and recording conditions
- AUDDT open-source toolkit that automates evaluation of any pretrained audio deepfake detector across all 28 datasets with minimal user effort
- Empirical analysis using a baseline ASVspoof2019-pretrained detector revealing large performance variance across deepfake subgroups and real-world deployment gaps
🛡️ Threat Analysis
Audio deepfake detection is a core ML09 concern (AI-generated content detection / output integrity). The toolkit systematically evaluates detectors' ability to identify AI-generated audio across diverse generation methods (diffusion, neural codec, vocoders), directly measuring output integrity assurance under real-world conditions.