Diffusion-Based Image Editing for Breaking Robust Watermarks

Robust invisible watermarking aims to embed hidden information into images such that the watermark can survive various image manipulations. However, the rise of powerful diffusion-based image generation and editing techniques poses a new threat to these watermarking schemes. In this paper, we present a theoretical study and method demonstrating that diffusion models can effectively break robust image watermarks that were designed to resist conventional perturbations. We show that a diffusion-driven ``image regeneration'' process can erase embedded watermarks while preserving perceptual image content. We further introduce a novel guided diffusion attack that explicitly targets the watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion-based transformation, the mutual information between the watermarked image and the embedded watermark payload vanishes, resulting in decoding failure. Experimentally, we evaluate our approach on multiple state-of-the-art watermarking schemes (including the deep learning-based methods StegaStamp, TrustMark, and VINE) and demonstrate near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings highlight a fundamental vulnerability in current robust watermarking techniques against generative model-based attacks, underscoring the need for new watermarking strategies in the era of generative AI.

Key Contributions

Theoretical proof that diffusion-based image transformation destroys mutual information between a watermarked image and its embedded payload, guaranteeing decoding failure under sufficient noise levels.
Unguided diffusion regeneration attack that erases robust image watermarks while preserving perceptual content fidelity.
Novel guided diffusion attack that incorporates the watermark decoder as an adversarial guide to explicitly suppress watermark signals during diffusion sampling.

🛡️ Threat Analysis

Output Integrity Attack

Paper attacks and removes content watermarks (StegaStamp, TrustMark, VINE) embedded in images for copyright/provenance protection. Watermark removal attacks on content watermarks are explicitly ML09. The paper's primary contributions are two novel attack methods (unguided diffusion regeneration and guided diffusion attack) that defeat these output integrity mechanisms.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

black_boxgrey_boxinference_timedigital

Datasets

StegaStampTrustMarkVINEHidden

Applications

2025 0 cit.

Output Integrity Attack

86%