defense 2025

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

Zhisheng Zhang 1,2, Derui Wang 3, Yifan Mi 2, Zhiyong Wu 1, Jie Gao 1, Yuxin Cao 4, Kai Ye 5, Minhui Xue 3,6, Jie Hao 2

0 citations · 53 references · arXiv

α

Published on arXiv

2511.07099

Input Manipulation Attack

OWASP ML Top 10 — ML01

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

E2E-VGuard successfully protects timbre and pronunciation against 16 open-source synthesizers and 3 commercial APIs across Chinese and English, validated in real-world deployment.

E2E-VGuard

Novel technique introduced


Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.


Key Contributions

  • Encoder ensemble with feature extractor to craft adversarial perturbations that protect timbre from LLM-based speech synthesizers
  • ASR-targeted adversarial examples that disrupt pronunciation in end-to-end voice cloning pipelines relying on automatic transcription
  • Psychoacoustic model integration to ensure perturbations are imperceptible to human listeners

🛡️ Threat Analysis

Input Manipulation Attack

Core technical contribution is crafting adversarial perturbations — specifically encoder-ensemble-targeted examples for timbre protection and ASR-targeted adversarial examples for pronunciation disruption — that fool speech synthesis pipeline components at inference time. The adversarial example methodology (gradient-based, constrained by psychoacoustic model) is the primary technical contribution.

Output Integrity Attack

The application goal is protecting audio content integrity against voice cloning fraud (deepfake audio generation). Anti-voice-cloning perturbations are the audio analog of anti-deepfake image perturbations, which fall under output integrity and content authenticity protection in ML09.


Details

Domains
audionlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timedigital
Datasets
Chinese speech datasetsEnglish speech datasets
Applications
voice cloning preventionspeech synthesis protectioncommercial tts apis