attack 2026

Vanishing Watermarks: Diffusion-Based Image Editing Undermines Robust Invisible Watermarking

Fan Guo , Jiyu Kang , Qi Ming , Emily Davis , Finn Carter

0 citations · 103 references · arXiv (Cornell University)

α

Published on arXiv

2602.20680

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Diffusion-based regeneration and guided attacks reduce watermark recovery to near-zero across StegaStamp, TrustMark, and VINE while preserving high visual fidelity of edited images.

Guided Diffusion Watermark Removal

Novel technique introduced


Robust invisible watermarking schemes aim to embed hidden information into images such that the watermark survives common manipulations. However, powerful diffusion-based image generation and editing techniques now pose a new threat to these watermarks. In this paper, we present a comprehensive theoretical and empirical analysis demonstrating that diffusion models can effectively erase robust watermarks even when those watermarks were designed to withstand conventional distortions. We show that a diffusion-driven image regeneration process, which leverages generative models to recreate an image, can remove embedded watermarks while preserving the image's perceptual content. Furthermore, we introduce a guided diffusion-based attack that explicitly targets the embedded watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion transformations, the mutual information between the watermarked image and the hidden payload approaches zero, leading to inevitable decoding failure. Experimentally, we evaluate multiple state-of-the-art watermarking methods (including deep learning-based schemes like StegaStamp, TrustMark, and VINE) and demonstrate that diffusion edits yield near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings reveal a fundamental vulnerability in current robust watermarking techniques against generative model-based edits, underscoring the need for new strategies to ensure watermark resilience in the era of powerful diffusion models.


Key Contributions

  • Empirical demonstration that leading robust watermarking schemes (StegaStamp, TrustMark, VINE) are reduced to near-zero recovery rates by diffusion-based image editing
  • Novel guided diffusion watermark removal algorithm that integrates watermark decoder feedback into the diffusion denoising loop to explicitly erase hidden payloads
  • Theoretical proof that mutual information between a watermarked image and its hidden payload approaches zero as diffusion transformations increase, establishing a fundamental vulnerability

🛡️ Threat Analysis

Output Integrity Attack

The paper attacks content watermarks (StegaStamp, TrustMark, VINE) embedded in image outputs for provenance and copyright tracking — this is a watermark removal attack on output integrity/content authenticity, not on model ownership. The guided diffusion attack and image regeneration process constitute direct attacks on the content watermarking pipeline.


Details

Domains
visiongenerative
Model Types
diffusioncnn
Threat Tags
grey_boxblack_boxinference_timedigital
Datasets
StegaStamp benchmarkTrustMark benchmarkVINE benchmark
Applications
image watermarkingcontent provenancecopyright protectioncontent authenticity