Watermarks for Language Models via Probabilistic Automata

A recent watermarking scheme for language models achieves distortion-free embedding and robustness to edit-distance attacks. However, it suffers from limited generation diversity and high detection overhead. In parallel, recent research has focused on undetectability, a property ensuring that watermarks remain difficult for adversaries to detect and spoof. In this work, we introduce a new class of watermarking schemes constructed through probabilistic automata. We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions. Extensive experiments on LLaMA-3B and Mistral-7B validate the superior performance of our scheme in terms of robustness and efficiency.

Key Contributions

New class of LLM text watermarking schemes constructed via probabilistic automata, achieving exponential generation diversity and computational efficiency
Theoretical construction with formal undetectability guarantees under cryptographic assumptions, preventing adversaries from detecting or spoofing the watermark
Empirical validation on LLaMA-3B and Mistral-7B demonstrating superior robustness against edit-distance attacks and lower detection overhead than prior work

🛡️ Threat Analysis

Output Integrity Attack

Embeds detectable watermarks in LLM-generated text outputs to verify provenance and detect AI-generated content — a direct output integrity defense. The paper also addresses adversarial undetectability (preventing adversaries from detecting and spoofing the watermark), which is a robustness property of the content protection scheme.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

LLaMA-3B outputsMistral-7B outputs

Applications

2025 0 cit.

Output Integrity Attack

100%