defense 2025

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

Li An 1, Yujian Liu 1, Yepeng Liu 1, Yuheng Bu 1, Yang Zhang 2, Shiyu Chang 1

1 citations · 33 references · arXiv

α

Published on arXiv

2510.21053

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves state-of-the-art trade-off across all watermarking criteria with notable improvements in spoofing attack resistance without degrading detectability or text quality.

RL-Watermark

Novel technique introduced


Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to reward hacking. In this paper, we propose an end-to-end RL framework for robust and secure LLM watermarking. Our approach adopts an anchoring mechanism for reward terms to ensure stable training and introduces additional regularization terms to prevent reward hacking. Experiments on standard benchmarks with two backbone LLMs show that our method achieves a state-of-the-art trade-off across all criteria, with notable improvements in resistance to spoofing attacks without degrading other criteria. Our code is available at https://github.com/UCSB-NLP-Chang/RL-watermark.


Key Contributions

  • End-to-end RL framework that jointly optimizes LLM text watermarking across detectability, text quality, robustness against removal, and security against spoofing attacks
  • Anchoring mechanism for reward terms to stabilize multi-objective RL training with partially conflicting criteria
  • Regularization terms that prevent reward hacking in the large action space of green/red token list choices

🛡️ Threat Analysis

Output Integrity Attack

Watermarks are embedded in LLM text outputs (green/red token lists) to trace content provenance — this is output integrity and content watermarking, not model-weight watermarking. The paper explicitly addresses both removal attacks (robustness) and spoofing attacks (security), which are adversarial threats to content authenticity.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
C4
Applications
llm text provenanceai-generated text authenticationcontent tracing