Towards Robust Red-Green Watermarking for Autoregressive Image Generators

In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.

Key Contributions

Adaptation of red-green token-level watermarking (originally for LLMs) to autoregressive image generation models via visual token sequences
Cluster lookup table approach (training-free) assigning similar visual tokens to the same watermark set for improved robustness
Fine-tuned VAE encoder for direct cluster prediction from perturbed images, further boosting watermark detectability and robustness against perturbations and regeneration attacks

🛡️ Threat Analysis

Output Integrity Attack

The paper embeds watermarks in model-generated image outputs (visual token sequences decoded to pixels) to enable detection and attribution of AI-generated content. This is content provenance / output integrity watermarking — the watermark is in the generated content, not in model weights. The evaluation includes robustness against watermark-removal perturbations and regeneration attacks, which are output integrity threats.

Details

Domains

visiongenerative

Model Types

transformer

Threat Tags

inference_time

Applications

2025 0 cit.

Output Integrity Attack

91%