On the Robustness of Watermarking for Autoregressive Image Generation
Andreas Müller 1, Denis Lukovnikov 1, Shingo Kodama 2, Minh Pham 3, Anubhav Jain 4, Jonathan Petit 5, Niv Cohen 3, Asja Fischer 1
Published on arXiv
2604.11720
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Successfully removes and forges watermarks from AR image generators (BitMark, WMAR, ClusterMark, IndexMark) using only single watermarked reference images without access to model parameters or watermarking secrets
Watermark Mimicry
Novel technique introduced
The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator's watermark and trigger false detection to prevent their inclusion in future model training.
Key Contributions
- Three novel watermark attacks: vector-quantized regeneration removal, adversarial optimization-based attack, and frequency injection attack
- Demonstrates watermark removal and forgery with single reference image and no model/secret access
- Introduces Watermark Mimicry attack where authentic images are manipulated to trigger false AI-detection to prevent training data harvesting
🛡️ Threat Analysis
Paper attacks content watermarking schemes designed to trace AI-generated image provenance and enable dataset filtering. Proposes both watermark removal attacks (defeating content protection) and watermark forgery attacks (Watermark Mimicry - making authentic images appear AI-generated). This is output integrity/content provenance, not model IP protection.