Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments

Logical Access (LA) attacks, also known as audio deepfake attacks, use Text-to-Speech (TTS) or Voice Conversion (VC) methods to generate spoofed speech data. This can represent a serious threat to Automatic Speaker Verification (ASV) systems, as intruders can use such attacks to bypass voice biometric security. In this study, we investigate the correlation between speech quality and the performance of audio spoofing detection systems (i.e., LA task). For that, the performance of two enhancement algorithms is evaluated based on two perceptual speech quality measures, namely Perceptual Evaluation of Speech Quality (PESQ) and Speech-to-Reverberation Modulation Ratio (SRMR), and in respect to their impact on the audio spoofing detection system. We adopted the LA dataset, provided in the ASVspoof 2019 Challenge, and corrupted its test set with different Signal-to-Noise Ratio (SNR) levels, while leaving the training data untouched. Enhancement was applied to attenuate the detrimental effects of noisy speech, and the performances of two models, Speech Enhancement Generative Adversarial Network (SEGAN) and Metric-Optimized Generative Adversarial Network Plus (MetricGAN+), were compared. Although we expect that speech quality will correlate well with speech applications' performance, it can also have as a side effect on downstream tasks if unwanted artifacts are introduced or relevant information is removed from the speech signal. Our results corroborate with this hypothesis, as we found that the enhancement algorithm leading to the highest speech quality scores, MetricGAN+, provided the lowest Equal Error Rate (EER) on the audio spoofing detection task, whereas the enhancement method with the lowest speech quality scores, SEGAN, led to the lowest EER, thus leading to better performance on the LA task.

Key Contributions

Evaluates impact of speech enhancement algorithms (SEGAN, MetricGAN+) on deepfake detection under noisy conditions
Finds counter-intuitive result: lower speech quality (SEGAN) yields better detection performance than higher quality (MetricGAN+)
Demonstrates that enhancement can introduce artifacts or remove discriminative features critical for spoofing detection

🛡️ Threat Analysis

Output Integrity Attack

Paper focuses on detecting audio deepfakes (synthetic speech generated by TTS/VC systems) to verify content authenticity. The core contribution is evaluating how preprocessing (enhancement) affects deepfake detection performance — this is output integrity verification (detecting AI-generated audio content).

Details

Domains

audio

Model Types

gan

Threat Tags

inference_time

Datasets

ASVspoof 2019 LA

Applications

2026 0 cit.

Output Integrity Attack

78%

Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

On Deepfake Voice Detection -- It's All in the Presentation

What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning

Toward Noise-Aware Audio Deepfake Detection: Survey, SNR-Benchmarks, and Practical Recipes

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR