attack 2025

On the Information-Theoretic Fragility of Robust Watermarking under Diffusion Editing

Yunyi Ni , Ziyu Yang , Ze Niu , Emily Davis , Finn Carter

1 citations · arXiv

α

Published on arXiv

2511.10933

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves near-zero watermark recovery rates against state-of-the-art deep learning watermarking schemes (StegaStamp, TrustMark, VINE) using guided diffusion editing while maintaining high visual fidelity of regenerated images

Guided Diffusion Attack

Novel technique introduced


Robust invisible watermarking embeds hidden information in images such that the watermark can survive various manipulations. However, the emergence of powerful diffusion-based image generation and editing techniques poses a new threat to these watermarking schemes. In this paper, we investigate the intersection of diffusion-based image editing and robust image watermarking. We analyze how diffusion-driven image edits can significantly degrade or even fully remove embedded watermarks from state-of-the-art robust watermarking systems. Both theoretical formulations and empirical experiments are provided. We prove that as a image undergoes iterative diffusion transformations, the mutual information between the watermarked image and the embedded payload approaches zero, causing watermark decoding to fail. We further propose a guided diffusion attack algorithm that explicitly targets and erases watermark signals during generation. We evaluate our approach on recent deep learning-based watermarking schemes and demonstrate near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Finally, we discuss ethical implications of such watermark removal capablities and provide design guidelines for future watermarking strategies to be more resilient in the era of generative AI.


Key Contributions

  • Information-theoretic proof that iterative diffusion transformations drive mutual information between watermarked image and embedded payload to zero
  • Guided diffusion attack algorithm that explicitly targets and erases watermark signals during the diffusion generation process
  • Empirical evaluation on StegaStamp, TrustMark, and VINE demonstrating near-zero watermark recovery rates with high visual fidelity preserved

🛡️ Threat Analysis

Output Integrity Attack

Paper proposes and analyzes attacks that remove/defeat content watermarks (StegaStamp, TrustMark, VINE) embedded in images for provenance tracking — watermark removal attacks on output integrity schemes are explicitly ML09. The guided diffusion attack erases embedded payload signals, and the paper provides an information-theoretic proof that mutual information between watermarked image and payload approaches zero under iterative diffusion.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
grey_boxinference_timedigital
Applications
image watermarkingcontent provenancecopyright protection