defense 2026

Towards Anytime-Valid Statistical Watermarking

Baihe Huang , Eric Xu , Kannan Ramchandran , Jiantao Jiao , Michael I. Jordan

0 citations · arXiv (Cornell University)

α

Published on arXiv

2602.17608

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Reduces average token budget required for watermark detection by 13–15% relative to state-of-the-art baselines while preserving anytime-valid Type-I error guarantees.

Anchored E-Watermarking

Novel technique introduced


The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.


Key Contributions

  • First e-value-based watermarking framework ('Anchored E-Watermarking') that enables anytime-valid inference via a test supermartingale, allowing valid early stopping without inflating Type-I error
  • Principled characterization of the optimal e-value with respect to worst-case log-growth rate using an anchor distribution to approximate the target LLM
  • Derivation of optimal expected stopping time and empirical validation showing 13–15% reduction in average token budget for detection versus state-of-the-art baselines

🛡️ Threat Analysis

Output Integrity Attack

Embeds watermarks in LLM text outputs to verify content provenance and distinguish machine-generated from human text — this is output integrity / content authentication, not model IP protection (ML05). The watermark lives in the generated tokens, not model weights.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
established NLP benchmarks (unspecified in abstract/body excerpt)
Applications
llm-generated text detectionai content provenancemachine-generated text attribution