An Ensemble Framework for Unbiased Language Model Watermarking

As large language models become increasingly capable and widely deployed, verifying the provenance of machine-generated content is critical to ensuring trust, safety, and accountability. Watermarking techniques have emerged as a promising solution by embedding imperceptible statistical signals into the generation process. Among them, unbiased watermarking is particularly attractive due to its theoretical guarantee of preserving the language model's output distribution, thereby avoiding degradation in fluency or detectability through distributional shifts. However, existing unbiased watermarking schemes often suffer from weak detection power and limited robustness, especially under short text lengths or distributional perturbations. In this work, we propose ENS, a novel ensemble framework that enhances the detectability and robustness of logits-based unbiased watermarks while strictly preserving their unbiasedness. ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal. We theoretically prove that the ensemble construction remains unbiased in expectation and demonstrate how it improves the signal-to-noise ratio for statistical detectors. Empirical evaluations on multiple LLM families show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks without compromising generation quality.

Key Contributions

ENS ensemble framework that sequentially composes multiple independent unbiased watermark instances with distinct keys to amplify detection signal while preserving output distribution
Theoretical proof that the ensemble construction remains unbiased in expectation and improves signal-to-noise ratio by approximately √n with n watermark layers
Empirical demonstration that ENS substantially reduces tokens needed for reliable detection and improves robustness to smoothing and paraphrasing attacks across multiple LLM families

🛡️ Threat Analysis

Output Integrity Attack

Embeds statistical watermark signals into LLM-generated text outputs to verify provenance and detect AI-generated content — output integrity and content authenticity. The watermark is in the generated text, not in model weights, making this ML09 not ML05.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

2025 0 cit.

Output Integrity Attack

100%