An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems

Evasion attacks pose significant threats to AI systems, exploiting vulnerabilities in machine learning models to bypass detection mechanisms. The widespread use of voice data, including deepfakes, in promising future industries is currently hindered by insufficient legal frameworks. Adversarial attack methods have emerged as the most effective countermeasure against the indiscriminate use of such data. This research introduces masked energy perturbation (MEP), a novel approach using power spectrum for energy masking of original voice data. MEP applies masking to small energy regions in the frequency domain before generating adversarial perturbations, targeting areas less noticeable to the human auditory model. The study primarily employs advanced speaker recognition models, including ECAPA-TDNN and ResNet34, which have shown remarkable performance in speaker verification tasks. The proposed MEP method demonstrated strong performance in both audio quality and evasion effectiveness. The energy masking approach effectively minimizes the perceptual evaluation of speech quality (PESQ) degradation, indicating that minimal perceptual distortion occurs to the human listener despite the adversarial perturbations. Specifically, in the PESQ evaluation, the relative performance of the MEP method was 26.68% when compared to the fast gradient sign method (FGSM) and iterative FGSM.

Key Contributions

Masked Energy Perturbation (MEP): applies STFT-based energy masking to focus adversarial perturbations on high-energy, perceptually prominent frequency bins while suppressing low-energy regions using a threshold derived from the power spectrum
Achieves 20% higher attack success rates than conventional FGSM-based methods while maintaining PESQ scores above 3.5, with a 26.68% relative PESQ improvement over FGSM
Demonstrates the attack against state-of-the-art speaker verification models (ECAPA-TDNN and ResNet34) on LibriSpeech, showing robustness across noise environments and speaker variations

🛡️ Threat Analysis

Input Manipulation Attack

MEP is a gradient-based adversarial perturbation attack targeting speaker recognition models (ECAPA-TDNN, ResNet34) at inference time; it applies frequency-domain energy masking before generating adversarial examples to achieve evasion while minimizing perceptual distortion, directly building on FGSM and iterative FGSM adversarial attack methodology.

Details

Domains

audio

Model Types

cnntransformer

Threat Tags

white_boxinference_timetargeteddigital

Datasets

LibriSpeech

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Mirage Fools the Ear, Mute Hides the Truth: Precise Targeted Adversarial Attacks on Polyphonic Sound Event Detection Systems

MORE: Multi-Objective Adversarial Attacks on Speech Recognition

DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems

Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

MAIA: An Inpainting-Based Approach for Music Adversarial Attacks

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network