attack 2026

VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings

Bharath Krishnamurthy , Ajita Rattani

0 citations · 32 references · arXiv

α

Published on arXiv

2601.20883

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Achieves a 67.8% morphing attack success rate against automated speaker verification systems under strict security thresholds in a zero-shot setting requiring only five seconds of audio per subject

VoxMorph

Novel technique introduced


Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: https://vcbsl.github.io/VoxMorph/


Key Contributions

  • VoxMorph: a zero-shot voice identity morphing framework that disentangles vocal traits into independent prosody and timbre embeddings, enabling fine-grained interpolation without identity-pair-specific fine-tuning
  • Slerp-based fusion of disentangled speaker embeddings paired with an autoregressive LM + Conditional Flow Matching synthesis pipeline, removing the acoustic-similarity constraint of prior work
  • Achieves 67.8% morphing attack success rate on ASV systems using as little as 5 seconds of audio per subject, with 2.6x audio quality gain and 73% reduction in intelligibility errors over the only prior VIM method

🛡️ Threat Analysis

Input Manipulation Attack

VoxMorph crafts morphed voice inputs via generative synthesis that cause ASV systems to incorrectly verify multiple identities at inference time — a generative evasion attack on ML-based biometric verification models analogous to face morphing attacks on face recognition.


Details

Domains
audiogenerative
Model Types
generativetransformer
Threat Tags
black_boxinference_timetargeted
Applications
speaker verificationbiometric securityvoice identity morphing