benchmark 2026

Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

Robin Singh , Aditya Yogesh Nair , Fabio Palumbo , Florian Barbaro , Anna Dyka , Lohith Rachakonda

0 citations · 24 references · arXiv

α

Published on arXiv

2601.20510

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

No single-paradigm detector generalizes across all TTS architectures; LLM-based synthesis exposes systematic vulnerabilities in semantic detectors, while multi-view integration achieves near-perfect separation across all attack vectors.


Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.


Key Contributions

  • Novel evaluation corpus of 12,000 synthetic audio samples spanning three modern TTS paradigms (Dia2/streaming, Maya1/LLM-based, MeloTTS/non-autoregressive) derived from the DailyDialog dataset
  • First empirical characterization showing semantic detectors (Whisper-MesoNet) are systematically vulnerable to LLM-based TTS, while hierarchical fusion models fail against flow-matching synthesis
  • Demonstrates that multi-view detection combining semantic, structural, and signal-level analysis achieves robust performance across all evaluated TTS architectures where single-paradigm detectors fail

🛡️ Threat Analysis

Output Integrity Attack

Evaluates AI-generated audio content detection (audio deepfakes) — directly addresses output integrity by assessing whether existing detectors can authenticate whether speech is human or synthetic. Constructs a novel corpus of AI-generated audio and characterizes detection failure modes against new TTS paradigms.


Details

Domains
audiogenerativenlp
Model Types
transformerllm
Threat Tags
inference_timeblack_box
Datasets
DailyDialogUncovAI TTS corpus
Applications
audio deepfake detectionspeaker verification securitysynthetic speech detection