defense 2026

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

Ruibo Chen 1, Yihan Wu 1, Xuehao Cui 1, Jingqi Zhang 2, Heng Huang 1

0 citations · 22 references · arXiv (Cornell University)

α

Published on arXiv

2602.11793

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Weaker single-layer watermarks consistently outperform strong baselines in both detectability and robustness by mitigating entropy decay across ensemble layers

Weaker Distortion-Free Watermark Ensemble

Novel technique introduced


Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.


Key Contributions

  • Theoretical proof that detectability in distortion-free watermarks is bounded by token distribution entropy, and that watermark ensembles cause monotonic entropy decay across layers
  • Identification of the 'stronger-is-better' fallacy in watermark ensembles: stronger single-layer watermarks reduce entropy and paradoxically degrade multi-layer ensemble performance
  • A general framework using a mixing coefficient λ to weaken single-layer watermarks, preserving entropy and improving overall ensemble detectability and robustness

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses watermarking of LLM-generated text outputs for provenance tracking and AI-generated content detection — a core output integrity concern. The framework embeds watermark signals in generated token distributions to enable post-hoc detection of machine-generated text.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
ai-generated text detectionllm output attributioncontent provenance