Audio Deepfake Detection in the Age of Advanced Text-to-Speech models
Robin Singh , Aditya Yogesh Nair , Fabio Palumbo , Florian Barbaro , Anna Dyka , Lohith Rachakonda
Published on arXiv
2601.20510
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
No single-paradigm detector generalizes across all TTS architectures; LLM-based synthesis exposes systematic vulnerabilities in semantic detectors, while multi-view integration achieves near-perfect separation across all attack vectors.
Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.
Key Contributions
- Novel evaluation corpus of 12,000 synthetic audio samples spanning three modern TTS paradigms (Dia2/streaming, Maya1/LLM-based, MeloTTS/non-autoregressive) derived from the DailyDialog dataset
- First empirical characterization showing semantic detectors (Whisper-MesoNet) are systematically vulnerable to LLM-based TTS, while hierarchical fusion models fail against flow-matching synthesis
- Demonstrates that multi-view detection combining semantic, structural, and signal-level analysis achieves robust performance across all evaluated TTS architectures where single-paradigm detectors fail
🛡️ Threat Analysis
Evaluates AI-generated audio content detection (audio deepfakes) — directly addresses output integrity by assessing whether existing detectors can authenticate whether speech is human or synthetic. Constructs a novel corpus of AI-generated audio and characterizes detection failure modes against new TTS paradigms.