attack 2026

DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

Ridwan Arefeen 1, Xiaoxiao Miao 2, Rong Tong 1, Aik Beng Ng 3, Simon See 3, Timothy Liu 3

0 citations

α

Published on arXiv

2603.12840

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Surpasses state-of-the-art attackers in EER when fine-tuning on only 10% of target anonymization data; Stage II alone enables strong cross-system performance without target-specific adaptation

DAST

Novel technique introduced


Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encoders with a three-stage training strategy. Stage I establishes foundational speaker-discriminative representations. Stage II leverages the shared identity-transformation characteristics of voice conversion and anonymization, exposing the model to diverse converted speech to build cross-system robustness. Stage III provides lightweight adaptation to target anonymized data. Results on the VoicePrivacy Attacker Challenge (VPAC) dataset demonstrate that Stage II is the primary driver of generalization, enabling strong attacking performance on unseen anonymization datasets. With Stage III, fine-tuning on only 10\% of the target anonymization dataset surpasses current state-of-the-art attackers in terms of EER.


Key Contributions

  • Dual-stream architecture fusing spectral (Fbank) and self-supervised learning (WavLM) features via parallel ECAPA-TDNN encoders
  • Three-stage training strategy: Stage I on original speech, Stage II on diverse voice-converted data for cross-system robustness, Stage III for lightweight target-specific adaptation
  • Demonstrates that Stage II (training on voice conversion data) is the primary driver of generalization to unseen anonymization systems

🛡️ Threat Analysis

Input Manipulation Attack

The paper proposes an attacker system designed to defeat voice anonymization defenses at inference time by re-identifying speakers from anonymized audio — this is an evasion attack against a privacy-preserving transformation.


Details

Domains
audio
Model Types
cnntransformer
Threat Tags
inference_timeblack_box
Datasets
VoicePrivacy Attacker Challenge (VPAC)
Applications
speaker verificationvoice anonymization evaluation