Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?
Rostislav Makarov 1, Lea Schönherr 2, Timo Gerkmann 1
Published on arXiv
2509.21087
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Predictive speech enhancement models are vulnerable to targeted adversarial attacks that alter output semantics, while diffusion models using stochastic samplers exhibit inherent robustness to such attacks by design.
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
Key Contributions
- Attack loss function and optimization procedure incorporating psychoacoustic constraints applicable to both predictive and diffusion-based SE models
- Comparative vulnerability analysis across regression SE, mask-based SE, and score-based diffusion SE (SGMSE+), revealing that stochastic diffusion samplers confer inherent adversarial robustness
- Evaluation framework using attack success metrics (DistillMOS, WER, ESTOI, POLQA) and perturbation impact metrics on EARS-WHAM-v2
🛡️ Threat Analysis
Gradient-based adversarial perturbation attack at inference time: psychoacoustically masked adversarial noise is injected into speech signals so that SE model outputs convey a completely different semantic meaning than intended — a targeted evasion/manipulation attack against predictive and generative SE models.