attack 2025

Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Kamel Kamel , Hridoy Sankar Dutta , Keshav Sood , Sunil Aryal

0 citations

α

Published on arXiv

2509.07677

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SMIA achieves at least 82% attack success rate against combined voice authentication and anti-spoofing systems and 100% against standalone anti-spoofing countermeasures in black-box settings

SMIA (Spectral Masking and Interpolation Attack)

Novel technique introduced


Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.


Key Contributions

  • Novel spectral masking and interpolation technique that perturbs inaudible frequency regions of AI-generated audio to produce adversarial samples that are indistinguishable to human listeners
  • Comprehensive black-box attack achieving ≥82% ASR against combined VAS/CM pipelines, ≥97.5% against standalone speaker verification, and 100% against standalone anti-spoofing countermeasures
  • Evaluation under simulated real-world conditions across SOTA anti-spoofing and speaker verification systems including commercial APIs (Microsoft Azure)

🛡️ Threat Analysis

Input Manipulation Attack

SMIA crafts adversarial audio inputs by strategically manipulating inaudible frequency regions to cause misclassification by speaker verification and anti-spoofing ML models at inference time — a classic inference-time evasion attack on ML classifiers, applied to the audio domain.


Details

Domains
audio
Model Types
cnntransformer
Threat Tags
black_boxinference_timetargeteddigital
Datasets
ASVSpoof 2019LibriSpeech
Applications
speaker verificationvoice authenticationanti-spoofing countermeasures