DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems

Modern Voice Control Systems (VCS) rely on the collaboration of Automatic Speech Recognition (ASR) and Speaker Recognition (SR) for secure interaction. However, prior adversarial attacks typically target these tasks in isolation, overlooking the coupled decision pipeline in real-world scenarios. Consequently, single-task attacks often fail to pose a practical threat. To fill this gap, we first utilize gradient analysis to reveal that ASR and SR exhibit no inherent conflicts. Building on this, we propose Dual-task Universal Adversarial Perturbation (DUAP). Specifically, DUAP employs a targeted surrogate objective to effectively disrupt ASR transcription and introduces a Dynamic Normalized Ensemble (DNE) strategy to enhance transferability across diverse SR models. Furthermore, we incorporate psychoacoustic masking to ensure perturbation imperceptibility. Extensive evaluations across five ASR and six SR models demonstrate that DUAP achieves high simultaneous attack success rates and superior imperceptibility, significantly outperforming existing single-task baselines.

Key Contributions

Gradient and mutual information analysis revealing that ASR and SR optimization objectives exhibit no inherent conflicts, enabling simultaneous dual-task adversarial attacks
DUAP attack framework combining a targeted surrogate objective for ASR disruption with a Dynamic Normalized Ensemble (DNE) strategy to enhance cross-model transferability for SR
Psychoacoustic masking constraint to ensure imperceptibility, validated via SNR and MOS across five ASR and six SR models including commercial APIs (Tencent, Alibaba, iFlytek)

🛡️ Threat Analysis

Input Manipulation Attack

Proposes gradient-based universal adversarial perturbations crafted to simultaneously cause mis-transcription in ASR models and identity spoofing in speaker recognition models at inference time — core adversarial example attack with both white-box optimization and black-box transferability components.

Details

Domains

audio

Model Types

transformer

Threat Tags

white_boxblack_boxinference_timetargeteddigital

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MORE: Multi-Objective Adversarial Attacks on Speech Recognition

Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems

Mirage Fools the Ear, Mute Hides the Truth: Precise Targeted Adversarial Attacks on Polyphonic Sound Event Detection Systems

MAIA: An Inpainting-Based Approach for Music Adversarial Attacks

Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network

Discrete optimal transport is a strong audio adversarial attack

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?