ML Security Papers

defense 2026

Refined Detection for Gumbel Watermarking

Google DeepMind

0 citations

α

Published on arXiv

2603.30017

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves near-optimal detection efficiency with bounds expressed in terms of the entropy distribution of next-token distributions, improving upon the Ω(log(1/δ)/H̄²) detection time of the original Gumbel scheme

Refined Gumbel Watermark Detection

Novel technique introduced

We propose a simple detection mechanism for the Gumbel watermarking scheme proposed by Aaronson (2022). The new mechanism is proven to be near-optimal in a problem-dependent sense among all model-agnostic watermarking schemes under the assumption that the next-token distribution is sampled i.i.d.

Key Contributions

Proposes a refined detection mechanism for Gumbel watermarking with near-optimal problem-dependent statistical efficiency
Proves matching upper and lower bounds on the number of tokens needed to detect watermarked text in terms of entropy-like quantities
Demonstrates the detection mechanism is near-optimal among all model-agnostic watermarking schemes under i.i.d. next-token distribution assumptions

🛡️ Threat Analysis

Output Integrity Attack

The paper addresses detection of watermarked LLM-generated text to verify content provenance. The Gumbel watermarking scheme embeds detectable signals in model outputs (not in model weights), making this an output integrity/content authentication defense.

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_time

Applications

text provenancellm output authenticationcontent attribution

Similar Papers

Optimizing Token Choice for Code Watermarking: An RL Approach

Output Integrity Attack

RTLMarker: Protecting LLM-Generated RTL Copyright via a Hardware Watermarking Framework

Output Integrity Attack

Adaptive Testing for Segmenting Watermarked Texts From Language Models

Output Integrity Attack

Detecting Cognitive Signatures in Typing Behavior for Non-Intrusive Authorship Verification

Output Integrity Attack

Online LLM watermark detection via e-processes

Output Integrity Attack

WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models

Output Integrity Attack

MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment

Output Integrity Attack

Two Birds with One Stone: Multi-Task Detection and Attribution of LLM-Generated Text

Output Integrity Attack