Discrete optimal transport is a strong audio adversarial attack
Anton Selitskiy 1, Akib Shahriyar 2, Jishnuraj Prakasan 2
Published on arXiv
2509.14959
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
OT-based distributional alignment of WavLM embeddings achieves consistently high equal error rate on AASIST anti-spoofing countermeasures across datasets and remains competitive after CM fine-tuning, outperforming kNN-VC, SinkVC, and MKL baselines.
kDOT-VC
Novel technique introduced
In this paper, we introduce the discrete optimal transport voice conversion ($k$DOT-VC) method. Comparison with $k$NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of our method. We use the probabilistic nature of optimal transport (OT) and show that $k$DOT-VC is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level {WavLM} embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Ablation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs.
Key Contributions
- Introduces kDOT-VC: entropic discrete OT with top-k barycentric projection for WavLM-based voice conversion that outperforms kNN-VC, SinkVC, and Gaussian OT (MKL) in domain adaptation fidelity.
- Demonstrates that distribution-level alignment via OT constitutes a powerful black-box adversarial attack against modern audio anti-spoofing CMs (AASIST), achieving strong cross-dataset transfer without CM gradient access.
- Ablation analysis isolating the contribution of distributional shift (vs. speaker identity transfer) to attack effectiveness against deployed countermeasures.
🛡️ Threat Analysis
kDOT-VC is an inference-time adversarial attack that crafts audio inputs (via OT-based distributional alignment in WavLM embedding space) to cause misclassification by deployed anti-spoofing countermeasure models. No gradient access is needed; the attack shifts the generated audio distribution toward bona fide regions to fool the CM.