attack 2026

Understanding Semantic Perturbations on In-Processing Generative Image Watermarks

Anirudh Nakra , Min Wu

0 citations

α

Published on arXiv

2603.27513

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Watermark detectability drops to near-zero under semantic edits while image quality remains high, even for methods robust to conventional perturbations


The widespread deployment of high-fidelity generative models has intensified the need for reliable mechanisms for provenance and content authentication. In-processing watermarking, embedding a signature into the generative model's synthesis procedure, has been advocated as a solution and is often reported to be robust to standard post-processing (such as geometric transforms and filtering). Yet robustness to semantic manipulations that alter high-level scene content while maintaining reasonable visual quality is not well studied or understood. We introduce a simple, multi-stage framework for systematically stress-testing in-processing generative watermarks under semantic drift. The framework utilizes off-the-shelf models for object detection, mask generation, and semantically guided inpainting or regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation. Based on extensive experiments on representative schemes, we find that robustness varies significantly with the degree of semantic entanglement: methods by which watermarks remain detectable under a broad suite of conventional perturbations can fail under semantic edits, with watermark detectability in many cases dropping to near zero while image quality remains high. Overall, our results reveal a critical gap in current watermarking evaluations and suggest that watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.


Key Contributions

  • Multi-stage framework for semantic watermark attacks using object detection, mask generation, and guided inpainting
  • Empirical demonstration that in-processing watermarks (Stable-Signature, Tree-Ring, Gaussian Shading) fail under semantic edits while surviving pixel-level perturbations
  • Reveals critical gap in watermark evaluation: semantic entanglement determines robustness, not just pixel-level resilience

🛡️ Threat Analysis

Output Integrity Attack

Paper attacks content watermarking schemes embedded in generative models. Semantic manipulation attacks (object replacement, inpainting) remove watermarks from AI-generated images while preserving visual quality — this is an attack on output integrity/content provenance, specifically watermark removal via semantic edits rather than pixel-level perturbations.


Details

Domains
visiongenerative
Model Types
diffusiongenerative
Threat Tags
black_boxinference_time
Applications
generative image watermarkingcontent authenticationai-generated image provenance