Vanishing Watermarks: Diffusion-Based Image Editing Undermines Robust Invisible Watermarking

Robust invisible watermarking schemes aim to embed hidden information into images such that the watermark survives common manipulations. However, powerful diffusion-based image generation and editing techniques now pose a new threat to these watermarks. In this paper, we present a comprehensive theoretical and empirical analysis demonstrating that diffusion models can effectively erase robust watermarks even when those watermarks were designed to withstand conventional distortions. We show that a diffusion-driven image regeneration process, which leverages generative models to recreate an image, can remove embedded watermarks while preserving the image's perceptual content. Furthermore, we introduce a guided diffusion-based attack that explicitly targets the embedded watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion transformations, the mutual information between the watermarked image and the hidden payload approaches zero, leading to inevitable decoding failure. Experimentally, we evaluate multiple state-of-the-art watermarking methods (including deep learning-based schemes like StegaStamp, TrustMark, and VINE) and demonstrate that diffusion edits yield near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings reveal a fundamental vulnerability in current robust watermarking techniques against generative model-based edits, underscoring the need for new strategies to ensure watermark resilience in the era of powerful diffusion models.

Key Contributions

Empirical demonstration that leading robust watermarking schemes (StegaStamp, TrustMark, VINE) are reduced to near-zero recovery rates by diffusion-based image editing
Novel guided diffusion watermark removal algorithm that integrates watermark decoder feedback into the diffusion denoising loop to explicitly erase hidden payloads
Theoretical proof that mutual information between a watermarked image and its hidden payload approaches zero as diffusion transformations increase, establishing a fundamental vulnerability

🛡️ Threat Analysis

Output Integrity Attack

The paper attacks content watermarks (StegaStamp, TrustMark, VINE) embedded in image outputs for provenance and copyright tracking — this is a watermark removal attack on output integrity/content authenticity, not on model ownership. The guided diffusion attack and image regeneration process constitute direct attacks on the content watermarking pipeline.

Details

Domains

visiongenerative

Model Types

diffusioncnn

Threat Tags

grey_boxblack_boxinference_timedigital

Datasets

StegaStamp benchmarkTrustMark benchmarkVINE benchmark

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Diffusion-Based Image Editing for Breaking Robust Watermarks

Untraceable DeepFakes via Traceable Fingerprint Elimination

MarkSweep: A No-box Removal Attack on AI-Generated Image Watermarking via Noise Intensification and Frequency-aware Denoising

MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation

On the Information-Theoretic Fragility of Robust Watermarking under Diffusion Editing

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

The Coding Limits of Robust Watermarking for Generative Models

First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge