defense 2026

A Unified Framework for LLM Watermarks

Thibaud Gloaguen , Robin Staab , Nikola Jovanović , Martin Vechev

0 citations · 26 references · arXiv (Cornell University)

α

Published on arXiv

2602.06754

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint, validating the proposed unified framework.

Constrained Optimization Framework for LLM Watermarks

Novel technique introduced


LLM watermarks allow tracing AI-generated texts by inserting a detectable signal into their generated content. Recent works have proposed a wide range of watermarking algorithms, each with distinct designs, usually built using a bottom-up approach. Crucially, there is no general and principled formulation for LLM watermarking. In this work, we show that most existing and widely used watermarking schemes can in fact be derived from a principled constrained optimization problem. Our formulation unifies existing watermarking methods and explicitly reveals the constraints that each method optimizes. In particular, it highlights an understudied quality-diversity-power trade-off. At the same time, our framework also provides a principled approach for designing novel watermarking schemes tailored to specific requirements. For instance, it allows us to directly use perplexity as a proxy for quality, and derive new schemes that are optimal with respect to this constraint. Our experimental evaluation validates our framework: watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint.


Key Contributions

  • Shows that most existing LLM watermarking schemes can be derived from a single principled constrained optimization problem, unifying disparate prior designs
  • Explicitly characterizes the quality-diversity-power trade-off inherent to watermarking and reveals which constraints each existing method optimizes
  • Derives novel watermarking schemes optimal with respect to specific constraints (e.g., perplexity as a quality proxy) and validates them experimentally

🛡️ Threat Analysis

Output Integrity Attack

LLM watermarking embeds detectable signals in model-generated text to trace provenance and authenticate AI-generated content — a direct instance of output integrity and content provenance protection. The paper proposes a unified framework for designing such watermarks.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
ai-generated text detectiontext provenancecontent attribution