Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World
Visar Berisha , Prad Kadambi , Isabella Lenz
Published on arXiv
2509.20405
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Newer synthesizers erase the legacy acoustic artifacts that detectors rely on, and conversational speech domains are consistently the hardest to secure, demonstrating that benchmark EER systematically overstates real-world detector robustness.
Coverage Debt Analysis
Novel technique introduced
Speech deepfake detectors are often evaluated on clean, benchmark-style conditions, but deployment occurs in an open world of shifting devices, sampling rates, codecs, environments, and attack families. This creates a ``coverage debt" for AI-based detectors: every new condition multiplies with existing ones, producing data blind spots that grow faster than data can be collected. Because attackers can target these uncovered regions, worst-case performance (not average benchmark scores) determines security. To demonstrate the impact of the coverage debt problem, we analyze results from a recent cross-testing framework. Grouping performance by bona fide domain and spoof release year, two patterns emerge: newer synthesizers erase the legacy artifacts detectors rely on, and conversational speech domains (teleconferencing, interviews, social media) are consistently the hardest to secure. These findings show that detection alone should not be relied upon for high-stakes decisions. Detectors should be treated as auxiliary signals within layered defenses that include provenance, personhood credentials, and policy safeguards.
Key Contributions
- Introduces 'coverage debt' — the multiplicative, combinatorially unbounded growth of required training conditions versus what can be collected — as a fundamental barrier to speech deepfake detector generalization.
- Demonstrates through cross-testing analysis that newer synthesizers eliminate legacy artifacts detectors rely on, and that conversational speech domains (teleconferencing, interviews, social media) are systematically the hardest to cover.
- Argues that worst-case performance (not average benchmark EER) determines security and that detectors should be treated as auxiliary signals within layered defenses rather than primary gatekeepers.
🛡️ Threat Analysis
Directly analyzes the robustness and failure modes of AI-generated speech content detectors (speech deepfake detection), framing coverage gaps as adversarially exploitable attack surfaces — a core output integrity and AI-generated content detection concern.