defense 2026

Refined Detection for Gumbel Watermarking

Tor Lattimore

0 citations

α

Published on arXiv

2603.30017

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves near-optimal detection efficiency with bounds expressed in terms of the entropy distribution of next-token distributions, improving upon the Ω(log(1/δ)/H̄²) detection time of the original Gumbel scheme

Refined Gumbel Watermark Detection

Novel technique introduced


We propose a simple detection mechanism for the Gumbel watermarking scheme proposed by Aaronson (2022). The new mechanism is proven to be near-optimal in a problem-dependent sense among all model-agnostic watermarking schemes under the assumption that the next-token distribution is sampled i.i.d.


Key Contributions

  • Proposes a refined detection mechanism for Gumbel watermarking with near-optimal problem-dependent statistical efficiency
  • Proves matching upper and lower bounds on the number of tokens needed to detect watermarked text in terms of entropy-like quantities
  • Demonstrates the detection mechanism is near-optimal among all model-agnostic watermarking schemes under i.i.d. next-token distribution assumptions

🛡️ Threat Analysis

Output Integrity Attack

The paper addresses detection of watermarked LLM-generated text to verify content provenance. The Gumbel watermarking scheme embeds detectable signals in model outputs (not in model weights), making this an output integrity/content authentication defense.


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_time
Applications
text provenancellm output authenticationcontent attribution