A Unified Framework for LLM Watermarks
Thibaud Gloaguen , Robin Staab , Nikola Jovanović , Martin Vechev
Published on arXiv
2602.06754
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint, validating the proposed unified framework.
Constrained Optimization Framework for LLM Watermarks
Novel technique introduced
LLM watermarks allow tracing AI-generated texts by inserting a detectable signal into their generated content. Recent works have proposed a wide range of watermarking algorithms, each with distinct designs, usually built using a bottom-up approach. Crucially, there is no general and principled formulation for LLM watermarking. In this work, we show that most existing and widely used watermarking schemes can in fact be derived from a principled constrained optimization problem. Our formulation unifies existing watermarking methods and explicitly reveals the constraints that each method optimizes. In particular, it highlights an understudied quality-diversity-power trade-off. At the same time, our framework also provides a principled approach for designing novel watermarking schemes tailored to specific requirements. For instance, it allows us to directly use perplexity as a proxy for quality, and derive new schemes that are optimal with respect to this constraint. Our experimental evaluation validates our framework: watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint.
Key Contributions
- Shows that most existing LLM watermarking schemes can be derived from a single principled constrained optimization problem, unifying disparate prior designs
- Explicitly characterizes the quality-diversity-power trade-off inherent to watermarking and reveals which constraints each existing method optimizes
- Derives novel watermarking schemes optimal with respect to specific constraints (e.g., perplexity as a quality proxy) and validates them experimentally
🛡️ Threat Analysis
LLM watermarking embeds detectable signals in model-generated text to trace provenance and authenticate AI-generated content — a direct instance of output integrity and content provenance protection. The paper proposes a unified framework for designing such watermarks.