attack 2025

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov 1, Lea Schönherr 2, Timo Gerkmann 1

0 citations · 15 references · arXiv

α

Published on arXiv

2509.21087

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Predictive speech enhancement models are vulnerable to targeted adversarial attacks that alter output semantics, while diffusion models using stochastic samplers exhibit inherent robustness to such attacks by design.


Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.


Key Contributions

  • Attack loss function and optimization procedure incorporating psychoacoustic constraints applicable to both predictive and diffusion-based SE models
  • Comparative vulnerability analysis across regression SE, mask-based SE, and score-based diffusion SE (SGMSE+), revealing that stochastic diffusion samplers confer inherent adversarial robustness
  • Evaluation framework using attack success metrics (DistillMOS, WER, ESTOI, POLQA) and perturbation impact metrics on EARS-WHAM-v2

🛡️ Threat Analysis

Input Manipulation Attack

Gradient-based adversarial perturbation attack at inference time: psychoacoustically masked adversarial noise is injected into speech signals so that SE model outputs convey a completely different semantic meaning than intended — a targeted evasion/manipulation attack against predictive and generative SE models.


Details

Domains
audio
Model Types
diffusioncnn
Threat Tags
white_boxinference_timetargeteddigital
Datasets
EARS-WHAM-v2
Applications
speech enhancementhearing aidstelephony systems