defense 2025

Watermarking Discrete Diffusion Language Models

Avi Bagchi 1, Akhil Bhimaraju 2, Moulik Choraria 2, Daniel Alabi 2, Lav R. Varshney 2,3

0 citations · 54 references · arXiv

α

Published on arXiv

2511.02083

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Provably distortion-free watermark on discrete diffusion LMs achieves reliable detection on LLaDA with false positive probability decaying exponentially in sequence length

Gumbel-max DDLM Watermark

Novel technique introduced


Watermarking has emerged as a promising technique to track AI-generated content and differentiate it from authentic human creations. While prior work extensively studies watermarking for autoregressive large language models (LLMs) and image diffusion models, it remains comparatively underexplored for discrete diffusion language models (DDLMs), which are becoming popular due to their high inference throughput. In this paper, we introduce one of the first watermarking methods for DDLMs. Our approach applies a distribution-preserving Gumbel-max sampling trick at every diffusion step and seeds the randomness by sequence position to enable reliable detection. We empirically demonstrate reliable detectability on LLaDA, a state-of-the-art DDLM. We also analytically prove that the watermark is distortion-free, with a false detection probability that decays exponentially in the sequence length. A key practical advantage is that our method realizes desired watermarking properties with no expensive hyperparameter tuning, making it straightforward to deploy and scale across models and benchmarks.


Key Contributions

  • First watermarking method tailored to discrete diffusion language models (DDLMs), applying distribution-preserving Gumbel-max sampling at every diffusion step
  • Analytical proof that the scheme is distortion-free and achieves exponentially decaying false positive probability in sequence length
  • Demonstrated reliable detectability on LLaDA with no expensive hyperparameter tuning, enabling straightforward deployment

🛡️ Threat Analysis

Output Integrity Attack

Proposes watermarking of model-generated text outputs to track content provenance and distinguish AI-generated from human-written content — a direct output integrity and content authenticity contribution.


Details

Domains
nlpgenerative
Model Types
diffusionllm
Threat Tags
inference_time
Datasets
LLaDA
Applications
ai-generated text detectioncontent provenancetext watermarking