benchmark 2025

Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

Udayon Sen 1, Alka Luqman 1,2, Anupam Chattopadhyay 1

0 citations · 22 references · arXiv

α

Published on arXiv

2512.13744

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Finetuning pretrained speech encoders (WavLM, Wav2Vec2, MMS) under multi-condition noise training reduces EER by approximately 10–15 percentage points at 10–0 dB SNR compared to frozen baselines.


Deepfake audio detection has progressed rapidly with strong pre-trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions - background noise (domestic/office/transport), room reverberation, and consumer channels - often lags clean-lab results. We survey and evaluate robustness for state-of-the-art audio deepfake detection models and present a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal-to-noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near-clean (35 dB) to very noisy (-5 dB) to quantify graceful degradation. We study multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC-AUC, and EER on binary and four-class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across backbones.


Key Contributions

  • Survey of noise robustness in state-of-the-art audio deepfake detection models, situating the problem within broader speech processing research
  • Reproducible SNR-controlled evaluation framework mixing MS-SNSD ambient noises with ASVspoof 2021 DF utterances across a wide SNR range (−5 dB to 35 dB)
  • Empirical comparison of frozen vs. finetuned WavLM, Wav2Vec2, and MMS encoders on binary and four-class detection tasks, with finetuning reducing EER by 10–15 percentage points at 10–0 dB SNR

🛡️ Threat Analysis

Output Integrity Attack

Audio deepfake detection is AI-generated content detection; the paper surveys state-of-the-art detectors, introduces a reproducible SNR-controlled evaluation framework, and proposes multi-condition training to improve detection of synthetic speech — all squarely within output integrity and content authenticity.


Details

Domains
audio
Model Types
transformer
Threat Tags
inference_time
Datasets
ASVspoof 2021 DFMS-SNSD
Applications
audio deepfake detectionspeaker verificationanti-spoofing