Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems
Chin Yuen Kwok , Jia Qi Yip , Zhen Qiu , Chi Hung Chi , Kwok Yan Lam
Published on arXiv
2509.09204
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Bona fide speech diversity (acoustic environment and speech style) is a significant source of ADD failure that traditional single-bona-fide-type evaluation frameworks systematically miss.
bona fide cross-testing
Novel technique introduced
Audio deepfake detection (ADD) models are commonly evaluated using datasets that combine multiple synthesizers, with performance reported as a single Equal Error Rate (EER). However, this approach disproportionately weights synthesizers with more samples, underrepresenting others and reducing the overall reliability of EER. Additionally, most ADD datasets lack diversity in bona fide speech, often featuring a single environment and speech style (e.g., clean read speech), limiting their ability to simulate real-world conditions. To address these challenges, we propose bona fide cross-testing, a novel evaluation framework that incorporates diverse bona fide datasets and aggregates EERs for more balanced assessments. Our approach improves robustness and interpretability compared to traditional evaluation methods. We benchmark over 150 synthesizers across nine bona fide speech types and release a new dataset to facilitate further research at https://github.com/cyaaronk/audio_deepfake_eval.
Key Contributions
- Bona fide cross-testing framework that evaluates ADD models across K diverse bona fide speech types paired with M synthesizers, generating M×K EERs for balanced, interpretable assessment
- Demonstrates that combined-dataset EER is biased toward overrepresented synthesizers, concealing vulnerabilities from underrepresented subsets
- Benchmark of 150+ synthesizers across 9 bona fide speech types with released dataset and score files at https://github.com/cyaaronk/audio_deepfake_eval
🛡️ Threat Analysis
Audio deepfake detection is AI-generated content detection — a primary ML09 concern. The paper proposes an evaluation framework that surfaces vulnerabilities in existing ADD models, directly contributing to the robustness of output integrity verification systems.