Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

Key Contributions

Attack loss function and optimization procedure incorporating psychoacoustic constraints applicable to both predictive and diffusion-based SE models
Comparative vulnerability analysis across regression SE, mask-based SE, and score-based diffusion SE (SGMSE+), revealing that stochastic diffusion samplers confer inherent adversarial robustness
Evaluation framework using attack success metrics (DistillMOS, WER, ESTOI, POLQA) and perturbation impact metrics on EARS-WHAM-v2

🛡️ Threat Analysis

Input Manipulation Attack

Gradient-based adversarial perturbation attack at inference time: psychoacoustically masked adversarial noise is injected into speech signals so that SE model outputs convey a completely different semantic meaning than intended — a targeted evasion/manipulation attack against predictive and generative SE models.

Details

Domains

audio

Model Types

diffusioncnn

Threat Tags

white_boxinference_timetargeteddigital

Datasets

EARS-WHAM-v2

Applications

2025 0 cit.

Input Manipulation Attack

67%