E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Zhisheng Zhang 1,2, Derui Wang 3, Yifan Mi 2, Zhiyong Wu 1, Jie Gao 1, Yuxin Cao 4, Kai Ye 5, Minhui Xue 3,6, Jie Hao 2
Published on arXiv
2511.07099
Input Manipulation Attack
OWASP ML Top 10 — ML01
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
E2E-VGuard successfully protects timbre and pronunciation against 16 open-source synthesizers and 3 commercial APIs across Chinese and English, validated in real-world deployment.
E2E-VGuard
Novel technique introduced
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.
Key Contributions
- Encoder ensemble with feature extractor to craft adversarial perturbations that protect timbre from LLM-based speech synthesizers
- ASR-targeted adversarial examples that disrupt pronunciation in end-to-end voice cloning pipelines relying on automatic transcription
- Psychoacoustic model integration to ensure perturbations are imperceptible to human listeners
🛡️ Threat Analysis
Core technical contribution is crafting adversarial perturbations — specifically encoder-ensemble-targeted examples for timbre protection and ASR-targeted adversarial examples for pronunciation disruption — that fool speech synthesis pipeline components at inference time. The adversarial example methodology (gradient-based, constrained by psychoacoustic model) is the primary technical contribution.
The application goal is protecting audio content integrity against voice cloning fraud (deepfake audio generation). Anti-voice-cloning perturbations are the audio analog of anti-deepfake image perturbations, which fall under output integrity and content authenticity protection in ML09.