Discrete optimal transport is a strong audio adversarial attack

In this paper, we introduce the discrete optimal transport voice conversion ($k$DOT-VC) method. Comparison with $k$NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of our method. We use the probabilistic nature of optimal transport (OT) and show that $k$DOT-VC is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level {WavLM} embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Ablation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs.

Key Contributions

Introduces kDOT-VC: entropic discrete OT with top-k barycentric projection for WavLM-based voice conversion that outperforms kNN-VC, SinkVC, and Gaussian OT (MKL) in domain adaptation fidelity.
Demonstrates that distribution-level alignment via OT constitutes a powerful black-box adversarial attack against modern audio anti-spoofing CMs (AASIST), achieving strong cross-dataset transfer without CM gradient access.
Ablation analysis isolating the contribution of distributional shift (vs. speaker identity transfer) to attack effectiveness against deployed countermeasures.

🛡️ Threat Analysis

Input Manipulation Attack

kDOT-VC is an inference-time adversarial attack that crafts audio inputs (via OT-based distributional alignment in WavLM embedding space) to cause misclassification by deployed anti-spoofing countermeasure models. No gradient access is needed; the attack shifts the generated audio distribution toward bona fide regions to fool the CM.