When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing
Fai Gu , Qiyu Tang , Te Wen , Emily Davis , Finn Carter
Published on arXiv
2603.04696
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Watermark payloads behave as low-energy, high-frequency signals that are systematically attenuated by diffusion forward noise injection and then suppressed as nuisance variation by the reverse denoising process, causing decoding error to approach random guessing under moderate editing strengths.
Robust invisible watermarking systems aim to embed imperceptible payloads that remain decodable after common post-processing such as JPEG compression, cropping, and additive noise. In parallel, diffusion-based image editing has rapidly matured into a default transformation layer for modern content pipelines, enabling instruction-based editing, object insertion and composition, and interactive geometric manipulation. This paper studies a subtle but increasingly consequential interaction between these trends: diffusion-based editing procedures may unintentionally compromise, and in extreme cases practically bypass, robust watermarking mechanisms that were explicitly engineered to survive conventional distortions. We develop a unified view of diffusion editors that (i) inject substantial Gaussian noise in a latent space and (ii) project back to the natural image manifold via learned denoising dynamics. Under this view, watermark payloads behave as low-energy, high-frequency signals that are systematically attenuated by the forward diffusion step and then treated as nuisance variation by the reverse generative process. We formalize this degradation using information-theoretic tools, proving that for broad classes of pixel-level watermark encoders/decoders the mutual information between the watermark payload and the edited output decays toward zero as the editing strength increases, yielding decoding error close to random guessing. We complement the theory with a realistic hypothetical experimental protocol and tables spanning representative watermarking methods and representative diffusion editors. Finally, we discuss ethical implications, responsible disclosure norms, and concrete design guidelines for watermarking schemes that remain meaningful in the era of generative transformations.
Key Contributions
- Information-theoretic formalization proving that mutual information between watermark payload and diffusion-edited output decays toward zero as editing strength increases, yielding decoding error approaching random guessing
- Unified analysis of diffusion editors (TF-ICON, SHINE, DragFlow) as stochastic operators combining latent noise injection and manifold reprojection, identifying which steps are most damaging to watermark signals
- Hypothetical experimental protocol evaluating representative watermarks (StegaStamp, TrustMark, VINE) against representative diffusion editors, plus concrete design guidelines for watermarking in the generative editing era
🛡️ Threat Analysis
The paper studies how diffusion-based editing (TF-ICON, SHINE, DragFlow) defeats pixel-level content watermarks (StegaStamp, TrustMark, VINE) that protect image provenance — this is a watermark removal/content integrity attack. The per-guidelines note explicitly states that 'attacks that REMOVE or DEFEAT image protections via denoising or purification → ML09', even when the removal is an unintentional side effect of editing. The information-theoretic formalization proves mutual information between watermark payload and edited output decays to zero, yielding near-random decoding error.