benchmark 2025

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

Daniyal Kabir Dar , Qiben Yan , Li Xiao , Arun Ross

0 citations

α

Published on arXiv

2509.15437

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Targeted adversarial examples against DeepSpeech induce systematic phoneme confusions and measurable identity drift in speaker embeddings (ECAPA-TDNN and ResNet) across all 16 test phrases, demonstrating dual ASR and biometric compromise

Identity Drift

Novel technique introduced


Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.


Key Contributions

  • Phonetic-level analysis of adversarial audio revealing systematic confusions (vowel centralization, consonant substitutions) underlying transcription errors
  • Introduction of 'identity drift' as a characterization of how adversarial perturbations degrade speaker embedding similarity in verification systems
  • Experimental demonstration that a single adversarial perturbation simultaneously compromises ASR transcription accuracy and speaker biometric identity across 16 phonetically diverse target phrases

🛡️ Threat Analysis

Input Manipulation Attack

Generates targeted adversarial waveform perturbations against DeepSpeech ASR and evaluates their impact on speaker verification embeddings — inference-time input manipulation causing misclassification and biometric degradation.


Details

Domains
audio
Model Types
rnncnntransformer
Threat Tags
white_boxinference_timetargeteddigital
Applications
automatic speech recognitionspeaker verificationvoice authentication