defense 2025

ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks

Yuanda Wang 1, Bocheng Chen 1, Hanqing Guo 2, Guangjing Wang 3, Weikang Ding 1, Qiben Yan 1

0 citations

α

Published on arXiv

2508.17660

Input Manipulation Attack

OWASP ML Top 10 — ML01

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

ClearMask and LiveMask effectively prevent voice deepfake attacks from deceiving both speaker verification models and human listeners across unseen synthesis models and black-box APIs while maintaining naturalness

ClearMask

Novel technique introduced


Voice deepfake attacks, which artificially impersonate human speech for malicious purposes, have emerged as a severe threat. Existing defenses typically inject noise into human speech to compromise voice encoders in speech synthesis models. However, these methods degrade audio quality and require prior knowledge of the attack approaches, limiting their effectiveness in diverse scenarios. Moreover, real-time audios, such as speech in virtual meetings and voice messages, are still exposed to voice deepfake threats. To overcome these limitations, we propose ClearMask, a noise-free defense mechanism against voice deepfake attacks. Unlike traditional approaches, ClearMask modifies the audio mel-spectrogram by selectively filtering certain frequencies, inducing a transferable voice feature loss without injecting noise. We then apply audio style transfer to further deceive voice decoders while preserving perceived sound quality. Finally, optimized reverberation is introduced to disrupt the output of voice generation models without affecting the naturalness of the speech. Additionally, we develop LiveMask to protect streaming speech in real-time through a universal frequency filter and reverberation generator. Our experimental results show that ClearMask and LiveMask effectively prevent voice deepfake attacks from deceiving speaker verification models and human listeners, even for unseen voice synthesis models and black-box API services. Furthermore, ClearMask demonstrates resilience against adaptive attackers who attempt to recover the original audio signal from the protected speech samples.


Key Contributions

  • ClearMask: noise-free audio protection via mel-spectrogram frequency filtering, audio style transfer, and optimized reverberation that disrupts voice synthesis models without degrading perceived audio quality
  • LiveMask: real-time streaming speech protection using a universal frequency filter and reverberation generator for virtual meetings and voice messages
  • Transferable defense that generalizes to unseen voice synthesis models and black-box API services, with demonstrated resilience against adaptive attackers attempting to recover the original signal

🛡️ Threat Analysis

Input Manipulation Attack

ClearMask crafts adversarial perturbations (frequency filtering, audio style transfer, reverberation) on input audio that cause voice synthesis models to produce incorrect outputs at inference time; the core mechanism is input manipulation targeting ML voice encoder/decoder pipelines in a transferable, black-box setting.

Output Integrity Attack

The paper's goal is protecting audio content integrity against AI-generated deepfake voice synthesis — it prevents voice cloning models from successfully impersonating a speaker, directly addressing the authenticity and provenance of audio content; anti-deepfake perturbation for audio falls under ML09's content integrity domain.


Details

Domains
audio
Model Types
generative
Threat Tags
black_boxinference_timedigital
Applications
voice synthesisvoice deepfake protectionspeaker verification