defense 2025

Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform

Yichuan Zhang , Chengxin Li , Yujie Gu

0 citations · 34 references · arXiv

α

Published on arXiv

2512.18791

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Smark outperforms existing post-processing and generative watermarking methods (WavMark, AudioSeal, GROOT) in both audio quality and watermark extraction accuracy under real-world attack scenarios including audio compression, noise addition, and resampling.

Smark

Novel technique introduced


Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio, which ensures seamless watermark-audio integration and is resistant to removal during the reverse diffusion process. Extensive experiments are conducted to evaluate the audio quality and watermark performance in various simulated real-world attack scenarios. The experimental results show that Smark achieves superior performance in both audio quality and watermark extraction accuracy.


Key Contributions

  • Universal watermarking framework operating within the shared reverse diffusion paradigm, achieving cross-model compatibility across all TTS diffusion architectures
  • DWT-based embedding into low-frequency sub-bands during reverse diffusion for imperceptible, robust watermarking that resists common audio processing attacks
  • Hypothesis-testing-based statistical watermark detection for reliable extraction verification

🛡️ Threat Analysis

Output Integrity Attack

Smark embeds watermarks INTO THE GENERATED AUDIO OUTPUT during the reverse diffusion process — this is content provenance/watermarking of model outputs, enabling source tracing and deepfake attribution. The watermark is in the audio, not in model weights, so this is ML09 (output integrity) not ML05 (model theft).


Details

Domains
audiogenerative
Model Types
diffusion
Threat Tags
inference_timedigital
Datasets
LJSpeech
Applications
text-to-speechaudio copyright protectionspeech source tracingdeepfake attribution