benchmark 2025

Analyzing and Evaluating Unbiased Language Model Watermark

Yihan Wu , Xuehao Cui , Ruibo Chen , Heng Huang

3 citations · 29 references · arXiv

α

Published on arXiv

2509.24048

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Token modification attacks provide significantly more stable and consistent robustness assessments of unbiased watermarks than paraphrasing-based attacks, which suffer from high variance and misleading results.

UWbench

Novel technique introduced


Verifying the authenticity of AI-generated text has become increasingly important with the rapid advancement of large language models, and unbiased watermarking has emerged as a promising approach due to its ability to preserve output distribution without degrading quality. However, recent work reveals that unbiased watermarks can accumulate distributional bias over multiple generations and that existing robustness evaluations are inconsistent across studies. To address these issues, we introduce UWbench, the first open-source benchmark dedicated to the principled evaluation of unbiased watermarking methods. Our framework combines theoretical and empirical contributions: we propose a statistical metric to quantify multi-batch distribution drift, prove an impossibility result showing that no unbiased watermark can perfectly preserve the distribution under infinite queries, and develop a formal analysis of robustness against token-level modification attacks. Complementing this theory, we establish a three-axis evaluation protocol: unbiasedness, detectability, and robustness, and show that token modification attacks provide more stable robustness assessments than paraphrasing-based methods. Together, UWbench offers the community a standardized and reproducible platform for advancing the design and evaluation of unbiased watermarking algorithms.


Key Contributions

  • UWbench: the first open-source benchmark for principled evaluation of unbiased LLM watermarking methods with reproducible tooling
  • Impossibility result proving no unbiased watermark can perfectly preserve output distribution under infinite queries, plus a multi-batch distribution drift metric
  • Three-axis evaluation protocol (unbiasedness, detectability, robustness) showing token modification attacks yield more stable robustness assessments than paraphrasing-based evaluations

🛡️ Threat Analysis

Output Integrity Attack

Watermarking LLM-generated text outputs to verify AI content authenticity and trace provenance is a core ML09 concern; the paper also evaluates robustness against token-level watermark removal attacks, which are attacks on content integrity.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Applications
llm text watermarkingai-generated text detectioncontent provenance verification