A Unified Framework for LLM Watermarks

LLM watermarks allow tracing AI-generated texts by inserting a detectable signal into their generated content. Recent works have proposed a wide range of watermarking algorithms, each with distinct designs, usually built using a bottom-up approach. Crucially, there is no general and principled formulation for LLM watermarking. In this work, we show that most existing and widely used watermarking schemes can in fact be derived from a principled constrained optimization problem. Our formulation unifies existing watermarking methods and explicitly reveals the constraints that each method optimizes. In particular, it highlights an understudied quality-diversity-power trade-off. At the same time, our framework also provides a principled approach for designing novel watermarking schemes tailored to specific requirements. For instance, it allows us to directly use perplexity as a proxy for quality, and derive new schemes that are optimal with respect to this constraint. Our experimental evaluation validates our framework: watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint.

Key Contributions

Shows that most existing LLM watermarking schemes can be derived from a single principled constrained optimization problem, unifying disparate prior designs
Explicitly characterizes the quality-diversity-power trade-off inherent to watermarking and reveals which constraints each existing method optimizes
Derives novel watermarking schemes optimal with respect to specific constraints (e.g., perplexity as a quality proxy) and validates them experimentally

🛡️ Threat Analysis

Output Integrity Attack

LLM watermarking embeds detectable signals in model-generated text to trace provenance and authenticate AI-generated content — a direct instance of output integrity and content provenance protection. The paper proposes a unified framework for designing such watermarks.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

2026 1 cit.

Output Integrity Attack

100%