Understanding Semantic Perturbations on In-Processing Generative Image Watermarks
Published on arXiv
2603.27513
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Watermark detectability drops to near-zero under semantic edits while image quality remains high, even for methods robust to conventional perturbations
The widespread deployment of high-fidelity generative models has intensified the need for reliable mechanisms for provenance and content authentication. In-processing watermarking, embedding a signature into the generative model's synthesis procedure, has been advocated as a solution and is often reported to be robust to standard post-processing (such as geometric transforms and filtering). Yet robustness to semantic manipulations that alter high-level scene content while maintaining reasonable visual quality is not well studied or understood. We introduce a simple, multi-stage framework for systematically stress-testing in-processing generative watermarks under semantic drift. The framework utilizes off-the-shelf models for object detection, mask generation, and semantically guided inpainting or regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation. Based on extensive experiments on representative schemes, we find that robustness varies significantly with the degree of semantic entanglement: methods by which watermarks remain detectable under a broad suite of conventional perturbations can fail under semantic edits, with watermark detectability in many cases dropping to near zero while image quality remains high. Overall, our results reveal a critical gap in current watermarking evaluations and suggest that watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.
Key Contributions
- Multi-stage framework for semantic watermark attacks using object detection, mask generation, and guided inpainting
- Empirical demonstration that in-processing watermarks (Stable-Signature, Tree-Ring, Gaussian Shading) fail under semantic edits while surviving pixel-level perturbations
- Reveals critical gap in watermark evaluation: semantic entanglement determines robustness, not just pixel-level resilience
🛡️ Threat Analysis
Paper attacks content watermarking schemes embedded in generative models. Semantic manipulation attacks (object replacement, inpainting) remove watermarks from AI-generated images while preserving visual quality — this is an attack on output integrity/content provenance, specifically watermark removal via semantic edits rather than pixel-level perturbations.