attack 2025

Diffusion-Based Image Editing for Breaking Robust Watermarks

Yunyi Ni 1, Finn Carter 2, Ze Niu 2, Emily Davis 2, Bo Zhang 2

2 citations · 20 references · arXiv

α

Published on arXiv

2510.05978

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves near-zero watermark recovery rates against StegaStamp, TrustMark, and VINE while maintaining high visual fidelity, outperforming conventional perturbation-based watermark removal attacks.

Guided Diffusion Attack

Novel technique introduced


Robust invisible watermarking aims to embed hidden information into images such that the watermark can survive various image manipulations. However, the rise of powerful diffusion-based image generation and editing techniques poses a new threat to these watermarking schemes. In this paper, we present a theoretical study and method demonstrating that diffusion models can effectively break robust image watermarks that were designed to resist conventional perturbations. We show that a diffusion-driven ``image regeneration'' process can erase embedded watermarks while preserving perceptual image content. We further introduce a novel guided diffusion attack that explicitly targets the watermark signal during generation, significantly degrading watermark detectability. Theoretically, we prove that as an image undergoes sufficient diffusion-based transformation, the mutual information between the watermarked image and the embedded watermark payload vanishes, resulting in decoding failure. Experimentally, we evaluate our approach on multiple state-of-the-art watermarking schemes (including the deep learning-based methods StegaStamp, TrustMark, and VINE) and demonstrate near-zero watermark recovery rates after attack, while maintaining high visual fidelity of the regenerated images. Our findings highlight a fundamental vulnerability in current robust watermarking techniques against generative model-based attacks, underscoring the need for new watermarking strategies in the era of generative AI.


Key Contributions

  • Theoretical proof that diffusion-based image transformation destroys mutual information between a watermarked image and its embedded payload, guaranteeing decoding failure under sufficient noise levels.
  • Unguided diffusion regeneration attack that erases robust image watermarks while preserving perceptual content fidelity.
  • Novel guided diffusion attack that incorporates the watermark decoder as an adversarial guide to explicitly suppress watermark signals during diffusion sampling.

🛡️ Threat Analysis

Output Integrity Attack

Paper attacks and removes content watermarks (StegaStamp, TrustMark, VINE) embedded in images for copyright/provenance protection. Watermark removal attacks on content watermarks are explicitly ML09. The paper's primary contributions are two novel attack methods (unguided diffusion regeneration and guided diffusion attack) that defeat these output integrity mechanisms.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
black_boxgrey_boxinference_timedigital
Datasets
StegaStampTrustMarkVINEHidden
Applications
image watermarkingcopyright protectioncontent authentication