defense 2026

Learning to Watermark in the Latent Space of Generative Models

Sylvestre-Alvise Rebuffi 1, Tuan Tran 1, Valeriu Lacatusu 1, Pierre Fernandez 1, Tomáš Souček 1, Nikola Jovanović 1,2, Tom Sander 1, Hady Elsahar 1, Alexandre Mourachko 1

0 citations · 62 references · arXiv

α

Published on arXiv

2601.16140

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Latent space watermarking achieves competitive robustness with up to 20x speedup compared to pixel-space post-hoc watermarking methods while enabling tamper-resistant in-model deployment.

DistSeal

Novel technique introduced


Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.


Key Contributions

  • DistSeal: a unified latent-space watermarking framework compatible with both diffusion and autoregressive image generation models, achieving up to 20x speedup over pixel-space baselines
  • Demonstrates that latent watermarkers can be distilled into generative model weights or the latent decoder for in-model watermarking that cannot be bypassed in open-source deployments
  • Empirically shows that distilling latent watermarkers is more effective than distilling pixel-space watermarkers, achieving state-of-the-art robustness for in-model watermarking

🛡️ Threat Analysis

Output Integrity Attack

Watermarks are embedded in model OUTPUTS (generated images) to authenticate synthetic origin and trace content provenance — this is content watermarking for output integrity, not model IP protection. The paper explicitly targets AI-generated content authentication and deepfake attribution.


Details

Domains
visiongenerative
Model Types
diffusiontransformer
Threat Tags
inference_timedigital
Applications
ai-generated image watermarkingcontent provenance trackingsynthetic media attributiondeepfake detection