attack 2025

Diffusion-Based Image Editing: An Unforeseen Adversary to Robust Invisible Watermarks

Wenkai Fu , Finn Carter , Yue Wang , Emily Davis , Bo Zhang

1 citations · 128 references · arXiv

α

Published on arXiv

2511.05598

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Diffusion-based editing reduces watermark decoding accuracy to near-zero levels across StegaStamp, TrustMark, and VINE while preserving high visual fidelity, revealing a fundamental vulnerability in robust watermarking against generative AI edits.

Guided Diffusion Attack

Novel technique introduced


Robust invisible watermarking aims to embed hidden messages into images such that they survive various manipulations while remaining imperceptible. However, powerful diffusion-based image generation and editing models now enable realistic content-preserving transformations that can inadvertently remove or distort embedded watermarks. In this paper, we present a theoretical and empirical analysis demonstrating that diffusion-based image editing can effectively break state-of-the-art robust watermarks designed to withstand conventional distortions. We analyze how the iterative noising and denoising process of diffusion models degrades embedded watermark signals, and provide formal proofs that under certain conditions a diffusion model's regenerated image retains virtually no detectable watermark information. Building on this insight, we propose a diffusion-driven attack that uses generative image regeneration to erase watermarks from a given image. Furthermore, we introduce an enhanced \emph{guided diffusion} attack that explicitly targets the watermark during generation by integrating the watermark decoder into the sampling loop. We evaluate our approaches on multiple recent deep learning watermarking schemes (e.g., StegaStamp, TrustMark, and VINE) and demonstrate that diffusion-based editing can reduce watermark decoding accuracy to near-zero levels while preserving high visual fidelity of the images. Our findings reveal a fundamental vulnerability in current robust watermarking techniques against generative model-based edits, underscoring the need for new watermarking strategies in the era of generative AI.


Key Contributions

  • Formal theoretical proof that ideal diffusion regeneration eliminates mutual information between watermarked image and embedded message, reducing decoding to random chance
  • Unguided diffusion regeneration attack that erases watermarks by passing images through a pretrained diffusion model's noise-denoise cycle
  • Guided diffusion attack that integrates the watermark decoder as an adversarial guide during generation to actively maximize watermark signal erasure while preserving visual fidelity

🛡️ Threat Analysis

Output Integrity Attack

The paper directly attacks content watermarks (StegaStamp, TrustMark, VINE) embedded in images for copyright protection and content authentication — watermark removal is the canonical ML09 threat. Both the unguided regeneration attack and the guided diffusion attack that integrates the watermark decoder into the sampling loop are attacks on output integrity schemes.


Details

Domains
visiongenerative
Model Types
diffusioncnn
Threat Tags
black_boxinference_timetargeted
Datasets
StegaStampTrustMarkVINE
Applications
image watermarkingcopyright protectioncontent authentication