Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.

Key Contributions

Novel evaluation corpus of 12,000 synthetic audio samples spanning three modern TTS paradigms (Dia2/streaming, Maya1/LLM-based, MeloTTS/non-autoregressive) derived from the DailyDialog dataset
First empirical characterization showing semantic detectors (Whisper-MesoNet) are systematically vulnerable to LLM-based TTS, while hierarchical fusion models fail against flow-matching synthesis
Demonstrates that multi-view detection combining semantic, structural, and signal-level analysis achieves robust performance across all evaluated TTS architectures where single-paradigm detectors fail

🛡️ Threat Analysis

Output Integrity Attack

Evaluates AI-generated audio content detection (audio deepfakes) — directly addresses output integrity by assessing whether existing detectors can authenticate whether speech is human or synthetic. Constructs a novel corpus of AI-generated audio and characterizes detection failure modes against new TTS paradigms.

Details

Domains

audiogenerativenlp

Model Types

transformerllm

Threat Tags

inference_timeblack_box

Datasets

DailyDialogUncovAI TTS corpus

Applications

2025 0 cit.

Output Integrity Attack

71%