attack 2025

First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge

Fahad Shamshad 1,2, Tameem Bakr 1, Yahia Shaaban 1, Noor Hussein 1,2, Karthik Nandakumar 1,2, Nils Lukas 1

0 citations

α

Published on arXiv

2508.21072

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves 95.7% watermark removal rate with negligible perceptual degradation, outperforming the runner-up by 26% (beige-box) and 31.7% (black-box) in detection score.

Adaptive VAE Evasion + Diffusion Purification

Novel technique introduced


Content watermarking is an important tool for the authentication and copyright protection of digital media. However, it is unclear whether existing watermarks are robust against adversarial attacks. We present the winning solution to the NeurIPS 2024 Erasing the Invisible challenge, which stress-tests watermark robustness under varying degrees of adversary knowledge. The challenge consisted of two tracks: a black-box and beige-box track, depending on whether the adversary knows which watermarking method was used by the provider. For the beige-box track, we leverage an adaptive VAE-based evasion attack, with a test-time optimization and color-contrast restoration in CIELAB space to preserve the image's quality. For the black-box track, we first cluster images based on their artifacts in the spatial or frequency-domain. Then, we apply image-to-image diffusion models with controlled noise injection and semantic priors from ChatGPT-generated captions to each cluster with optimized parameter settings. Empirical evaluations demonstrate that our method successfully achieves near-perfect watermark removal (95.7%) with negligible impact on the residual image's quality. We hope that our attacks inspire the development of more robust image watermarking methods.


Key Contributions

  • Adaptive VAE-based evasion attack for beige-box watermark removal with test-time optimization and CIELAB color-contrast restoration
  • Black-box watermark removal via frequency/spatial artifact clustering followed by diffusion-based purification guided by ChatGPT-generated semantic captions
  • First-place NeurIPS 2024 Erasing the Invisible solution, outperforming runners-up by 26% and 31.7% on beige-box and black-box tracks with 95.7% overall watermark removal

🛡️ Threat Analysis

Output Integrity Attack

The paper attacks content watermarks embedded in images — a direct attack on output integrity and content provenance systems. Per the guidelines, watermark removal attacks on image content protections are ML09, not ML01, even when the underlying technique uses adversarial-style optimization.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
black_boxgrey_boxinference_time
Datasets
NeurIPS 2024 Erasing the Invisible Challenge dataset
Applications
image watermarkingcontent authenticationcopyright protection