defense 2025

An Ensemble Framework for Unbiased Language Model Watermarking

Yihan Wu , Ruibo Chen , Georgios Milis , Heng Huang

3 citations · 27 references · arXiv

α

Published on arXiv

2509.24043

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

ENS substantially reduces the number of tokens needed for reliable watermark detection and increases resistance to smoothing and paraphrasing attacks without degrading generation quality across multiple LLM families.

ENS

Novel technique introduced


As large language models become increasingly capable and widely deployed, verifying the provenance of machine-generated content is critical to ensuring trust, safety, and accountability. Watermarking techniques have emerged as a promising solution by embedding imperceptible statistical signals into the generation process. Among them, unbiased watermarking is particularly attractive due to its theoretical guarantee of preserving the language model's output distribution, thereby avoiding degradation in fluency or detectability through distributional shifts. However, existing unbiased watermarking schemes often suffer from weak detection power and limited robustness, especially under short text lengths or distributional perturbations. In this work, we propose ENS, a novel ensemble framework that enhances the detectability and robustness of logits-based unbiased watermarks while strictly preserving their unbiasedness. ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal. We theoretically prove that the ensemble construction remains unbiased in expectation and demonstrate how it improves the signal-to-noise ratio for statistical detectors. Empirical evaluations on multiple LLM families show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks without compromising generation quality.


Key Contributions

  • ENS ensemble framework that sequentially composes multiple independent unbiased watermark instances with distinct keys to amplify detection signal while preserving output distribution
  • Theoretical proof that the ensemble construction remains unbiased in expectation and improves signal-to-noise ratio by approximately √n with n watermark layers
  • Empirical demonstration that ENS substantially reduces tokens needed for reliable detection and improves robustness to smoothing and paraphrasing attacks across multiple LLM families

🛡️ Threat Analysis

Output Integrity Attack

Embeds statistical watermark signals into LLM-generated text outputs to verify provenance and detect AI-generated content — output integrity and content authenticity. The watermark is in the generated text, not in model weights, making this ML09 not ML05.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
llm-generated text provenance verificationai content attribution