Diffusion-Based Image Editing: An Unforeseen Adversary to Robust Invisible Watermarks
Wenkai Fu , Finn Carter , Yue Wang , Emily Davis , Bo Zhang
Published on arXiv
2511.05598
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Diffusion-based editing reduces watermark decoding accuracy to near-zero levels across StegaStamp, TrustMark, and VINE while preserving high visual fidelity, revealing a fundamental vulnerability in robust watermarking against generative AI edits.
Guided Diffusion Attack
Novel technique introduced
Robust invisible watermarking aims to embed hidden messages into images such that they survive various manipulations while remaining imperceptible. However, powerful diffusion-based image generation and editing models now enable realistic content-preserving transformations that can inadvertently remove or distort embedded watermarks. In this paper, we present a theoretical and empirical analysis demonstrating that diffusion-based image editing can effectively break state-of-the-art robust watermarks designed to withstand conventional distortions. We analyze how the iterative noising and denoising process of diffusion models degrades embedded watermark signals, and provide formal proofs that under certain conditions a diffusion model's regenerated image retains virtually no detectable watermark information. Building on this insight, we propose a diffusion-driven attack that uses generative image regeneration to erase watermarks from a given image. Furthermore, we introduce an enhanced \emph{guided diffusion} attack that explicitly targets the watermark during generation by integrating the watermark decoder into the sampling loop. We evaluate our approaches on multiple recent deep learning watermarking schemes (e.g., StegaStamp, TrustMark, and VINE) and demonstrate that diffusion-based editing can reduce watermark decoding accuracy to near-zero levels while preserving high visual fidelity of the images. Our findings reveal a fundamental vulnerability in current robust watermarking techniques against generative model-based edits, underscoring the need for new watermarking strategies in the era of generative AI.
Key Contributions
- Formal theoretical proof that ideal diffusion regeneration eliminates mutual information between watermarked image and embedded message, reducing decoding to random chance
- Unguided diffusion regeneration attack that erases watermarks by passing images through a pretrained diffusion model's noise-denoise cycle
- Guided diffusion attack that integrates the watermark decoder as an adversarial guide during generation to actively maximize watermark signal erasure while preserving visual fidelity
🛡️ Threat Analysis
The paper directly attacks content watermarks (StegaStamp, TrustMark, VINE) embedded in images for copyright protection and content authentication — watermark removal is the canonical ML09 threat. Both the unguided regeneration attack and the guided diffusion attack that integrates the watermark decoder into the sampling loop are attacks on output integrity schemes.