defense 2025

Towards Robust Red-Green Watermarking for Autoregressive Image Generators

Denis Lukovnikov , Andreas Müller , Erwin Quiring , Asja Fischer

0 citations

α

Published on arXiv

2508.06656

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Cluster-level watermarks improve robustness against image perturbations and regeneration attacks while preserving image quality, with VAE-based cluster classification outperforming all baselines at fast verification runtime


In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.


Key Contributions

  • Adaptation of red-green token-level watermarking (originally for LLMs) to autoregressive image generation models via visual token sequences
  • Cluster lookup table approach (training-free) assigning similar visual tokens to the same watermark set for improved robustness
  • Fine-tuned VAE encoder for direct cluster prediction from perturbed images, further boosting watermark detectability and robustness against perturbations and regeneration attacks

🛡️ Threat Analysis

Output Integrity Attack

The paper embeds watermarks in model-generated image outputs (visual token sequences decoded to pixels) to enable detection and attribution of AI-generated content. This is content provenance / output integrity watermarking — the watermark is in the generated content, not in model weights. The evaluation includes robustness against watermark-removal perturbations and regeneration attacks, which are output integrity threats.


Details

Domains
visiongenerative
Model Types
transformer
Threat Tags
inference_time
Applications
ai-generated image detectioncontent attributionimage watermarking