Published on arXiv
2603.30017
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves near-optimal detection efficiency with bounds expressed in terms of the entropy distribution of next-token distributions, improving upon the Ω(log(1/δ)/H̄²) detection time of the original Gumbel scheme
Refined Gumbel Watermark Detection
Novel technique introduced
We propose a simple detection mechanism for the Gumbel watermarking scheme proposed by Aaronson (2022). The new mechanism is proven to be near-optimal in a problem-dependent sense among all model-agnostic watermarking schemes under the assumption that the next-token distribution is sampled i.i.d.
Key Contributions
- Proposes a refined detection mechanism for Gumbel watermarking with near-optimal problem-dependent statistical efficiency
- Proves matching upper and lower bounds on the number of tokens needed to detect watermarked text in terms of entropy-like quantities
- Demonstrates the detection mechanism is near-optimal among all model-agnostic watermarking schemes under i.i.d. next-token distribution assumptions
🛡️ Threat Analysis
The paper addresses detection of watermarked LLM-generated text to verify content provenance. The Gumbel watermarking scheme embeds detectable signals in model outputs (not in model weights), making this an output integrity/content authentication defense.