defense 2025

FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks

Yuwen Pu 1, Zhou Feng 2, Chunyi Zhou 2, Jiahao Chen 2, Chunqiang Hu 1, Haibo Hu 1, Shouling Ji 2

0 citations

α

Published on arXiv

2509.00561

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

FreeTalk achieves effective voice privacy protection against 5 speech synthesis models in a black-box setting while maintaining high speech quality, with a universal perturbation that generalizes to arbitrary-length audio without per-segment computation.

FreeTalk

Novel technique introduced


Recently, speech assistant and speech verification have been used in many fields, which brings much benefit and convenience for us. However, when we enjoy these speech applications, our speech may be collected by attackers for speech synthesis. For example, an attacker generates some inappropriate political opinions with the characteristic of the victim's voice by obtaining a piece of the victim's speech, which will greatly influence the victim's reputation. Specifically, with the appearance of some zero-shot voice conversion methods, the cost of speech synthesis attacks has been further reduced, which also brings greater challenges to user voice security and privacy. Some researchers have proposed the corresponding privacy-preserving methods. However, the existing approaches have some non-negligible drawbacks: low transferability and robustness, high computational overhead. These deficiencies seriously limit the existing method deployed in practical scenarios. Therefore, in this paper, we propose a lightweight, robust, plug-and-play privacy preservation method against speech synthesis attacks in a black-box setting. Our method generates and adds a frequency-domain perturbation to the original speech to achieve privacy protection and high speech quality. Then, we present a data augmentation strategy and noise smoothing mechanism to improve the robustness of the proposed method. Besides, to reduce the user's defense overhead, we also propose a novel identity-wise protection mechanism. It can generate a universal perturbation for one speaker and support privacy preservation for speech of any length. Finally, we conduct extensive experiments on 5 speech synthesis models, 5 speech verification models, 1 speech recognition model, and 2 datasets. The experimental results demonstrate that our method has satisfying privacy-preserving performance, high speech quality, and utility.


Key Contributions

  • Lightweight frequency-domain adversarial perturbation method that disrupts voice conversion and TTS models in a fully black-box setting without requiring knowledge of the attacker's model architecture or parameters
  • Universal (identity-wise) perturbation mechanism that generates a single protective perturbation per speaker, supporting audio of arbitrary length without per-segment recomputation
  • Data augmentation strategy and noise smoothing mechanism to improve cross-model transferability and robustness of the protective perturbations

🛡️ Threat Analysis

Input Manipulation Attack

FreeTalk's core mechanism is generating adversarial perturbations added to audio inputs so that downstream speech synthesis models (VC/TTS) malfunction at inference time — this is a proactive adversarial perturbation defense causing model failure at inference. The universal perturbation is analogous to Fawkes/AntiFake-style adversarial input protection, which falls squarely in ML01's defensive scope.


Details

Domains
audio
Model Types
transformer
Threat Tags
black_boxinference_time
Datasets
evaluated on 2 datasets across 5 speech synthesis models, 5 speech verification models, and 1 speech recognition model (specific dataset names not disclosed in the excerpt)
Applications
voice conversiontext-to-speech synthesisspeaker verificationspeech recognition