On the Information-Theoretic Fragility of Robust Watermarking under Diffusion Editing
Yunyi Ni , Ziyu Yang , Ze Niu , Emily Davis , Finn Carter
Published on arXiv
2511.10933
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves near-zero watermark recovery rates against state-of-the-art deep learning watermarking schemes (StegaStamp, TrustMark, VINE) using guided diffusion editing while maintaining high visual fidelity of regenerated images
Guided Diffusion Attack
Novel technique introduced
Robust invisible watermarking embeds hidden information in images such that the watermark can survive various manipulations. However, the emergence of powerful diffusion-based image generation and editing techniques poses a new threat to these watermarking schemes. In this paper, we investigate the intersection of diffusion-based image editing and robust image watermarking. We analyze how diffusion-driven image edits can significantly degrade or even fully remove embedded watermarks from state-of-the-art robust watermarking systems. Both theoretical formulations and empirical experiments are provided. We prove that as a image undergoes iterative diffusion transformations, the mutual information between the watermarked image and the embedded payload approaches zero, causing watermark decoding to fail. We further propose a guided diffusion attack algorithm that explicitly targets and erases watermark signals during generation. We evaluate our approach on recent deep learning-based watermarking schemes and demonstrate near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Finally, we discuss ethical implications of such watermark removal capablities and provide design guidelines for future watermarking strategies to be more resilient in the era of generative AI.
Key Contributions
- Information-theoretic proof that iterative diffusion transformations drive mutual information between watermarked image and embedded payload to zero
- Guided diffusion attack algorithm that explicitly targets and erases watermark signals during the diffusion generation process
- Empirical evaluation on StegaStamp, TrustMark, and VINE demonstrating near-zero watermark recovery rates with high visual fidelity preserved
🛡️ Threat Analysis
Paper proposes and analyzes attacks that remove/defeat content watermarks (StegaStamp, TrustMark, VINE) embedded in images for provenance tracking — watermark removal attacks on output integrity schemes are explicitly ML09. The guided diffusion attack erases embedded payload signals, and the paper provides an information-theoretic proof that mutual information between watermarked image and payload approaches zero under iterative diffusion.