defense 2025

Where is the Watermark? Interpretable Watermark Detection at the Block Level

Maria Bulychev , Neil G. Marchant , Benjamin I. P. Rubinstein

0 citations · 47 references · arXiv

α

Published on arXiv

2512.14994

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Watermarks remain robust to cropping up to half the image while providing interpretable block-level detection maps that outperform prior post-hoc methods in transparency

Block-level DWT Watermarking

Novel technique introduced


Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.


Key Contributions

  • Post-hoc image watermarking method using discrete wavelet transform domain with statistical block-wise embedding strategy
  • Region-level detection maps that reveal which blocks of an image are watermarked or semantically altered, providing interpretable tamper localization
  • Demonstration of robustness to common image transformations (e.g., cropping up to 50% of the image) while remaining sensitive to semantic manipulations and maintaining imperceptibility

🛡️ Threat Analysis

Output Integrity Attack

Embeds watermarks in image content (DWT domain) to trace provenance and detect tampering — this is content watermarking for output integrity/authenticity, not model weight watermarking. The detection maps that localize altered regions are a direct output integrity mechanism.


Details

Domains
vision
Threat Tags
inference_timedigital
Applications
digital media authenticationimage content provenancetamper detection