A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to reward hacking. In this paper, we propose an end-to-end RL framework for robust and secure LLM watermarking. Our approach adopts an anchoring mechanism for reward terms to ensure stable training and introduces additional regularization terms to prevent reward hacking. Experiments on standard benchmarks with two backbone LLMs show that our method achieves a state-of-the-art trade-off across all criteria, with notable improvements in resistance to spoofing attacks without degrading other criteria. Our code is available at https://github.com/UCSB-NLP-Chang/RL-watermark.

Key Contributions

End-to-end RL framework that jointly optimizes LLM text watermarking across detectability, text quality, robustness against removal, and security against spoofing attacks
Anchoring mechanism for reward terms to stabilize multi-objective RL training with partially conflicting criteria
Regularization terms that prevent reward hacking in the large action space of green/red token list choices

🛡️ Threat Analysis

Output Integrity Attack

Watermarks are embedded in LLM text outputs (green/red token lists) to trace content provenance — this is output integrity and content watermarking, not model-weight watermarking. The paper explicitly addresses both removal attacks (robustness) and spoofing attacks (security), which are adversarial threats to content authenticity.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

C4

Applications

2026 2 cit.

Output Integrity Attack

100%

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector

SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

Every Language Model Has a Forgery-Resistant Signature

Large Language Models Are Effective Code Watermarkers

When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

Simplex-Optimized Hybrid Ensemble for Large Language Model Text Detection Under Generative Distribution Drif

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text