benchmark 2026

Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments

Anacin 1, Angela , Shruti Kshirsagar 1, Anderson R. Avila 2

0 citations

α

Published on arXiv

2603.14767

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Enhancement method with lowest speech quality scores (SEGAN) achieved lowest EER on deepfake detection, outperforming higher-quality MetricGAN+


Logical Access (LA) attacks, also known as audio deepfake attacks, use Text-to-Speech (TTS) or Voice Conversion (VC) methods to generate spoofed speech data. This can represent a serious threat to Automatic Speaker Verification (ASV) systems, as intruders can use such attacks to bypass voice biometric security. In this study, we investigate the correlation between speech quality and the performance of audio spoofing detection systems (i.e., LA task). For that, the performance of two enhancement algorithms is evaluated based on two perceptual speech quality measures, namely Perceptual Evaluation of Speech Quality (PESQ) and Speech-to-Reverberation Modulation Ratio (SRMR), and in respect to their impact on the audio spoofing detection system. We adopted the LA dataset, provided in the ASVspoof 2019 Challenge, and corrupted its test set with different Signal-to-Noise Ratio (SNR) levels, while leaving the training data untouched. Enhancement was applied to attenuate the detrimental effects of noisy speech, and the performances of two models, Speech Enhancement Generative Adversarial Network (SEGAN) and Metric-Optimized Generative Adversarial Network Plus (MetricGAN+), were compared. Although we expect that speech quality will correlate well with speech applications' performance, it can also have as a side effect on downstream tasks if unwanted artifacts are introduced or relevant information is removed from the speech signal. Our results corroborate with this hypothesis, as we found that the enhancement algorithm leading to the highest speech quality scores, MetricGAN+, provided the lowest Equal Error Rate (EER) on the audio spoofing detection task, whereas the enhancement method with the lowest speech quality scores, SEGAN, led to the lowest EER, thus leading to better performance on the LA task.


Key Contributions

  • Evaluates impact of speech enhancement algorithms (SEGAN, MetricGAN+) on deepfake detection under noisy conditions
  • Finds counter-intuitive result: lower speech quality (SEGAN) yields better detection performance than higher quality (MetricGAN+)
  • Demonstrates that enhancement can introduce artifacts or remove discriminative features critical for spoofing detection

🛡️ Threat Analysis

Output Integrity Attack

Paper focuses on detecting audio deepfakes (synthetic speech generated by TTS/VC systems) to verify content authenticity. The core contribution is evaluating how preprocessing (enhancement) affects deepfake detection performance — this is output integrity verification (detecting AI-generated audio content).


Details

Domains
audio
Model Types
gan
Threat Tags
inference_time
Datasets
ASVspoof 2019 LA
Applications
speaker verificationaudio deepfake detection