Towards Robust Red-Green Watermarking for Autoregressive Image Generators
Denis Lukovnikov , Andreas Müller , Erwin Quiring , Asja Fischer
Published on arXiv
2508.06656
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Cluster-level watermarks improve robustness against image perturbations and regeneration attacks while preserving image quality, with VAE-based cluster classification outperforming all baselines at fast verification runtime
In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.
Key Contributions
- Adaptation of red-green token-level watermarking (originally for LLMs) to autoregressive image generation models via visual token sequences
- Cluster lookup table approach (training-free) assigning similar visual tokens to the same watermark set for improved robustness
- Fine-tuned VAE encoder for direct cluster prediction from perturbed images, further boosting watermark detectability and robustness against perturbations and regeneration attacks
🛡️ Threat Analysis
The paper embeds watermarks in model-generated image outputs (visual token sequences decoded to pixels) to enable detection and attribution of AI-generated content. This is content provenance / output integrity watermarking — the watermark is in the generated content, not in model weights. The evaluation includes robustness against watermark-removal perturbations and regeneration attacks, which are output integrity threats.