RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience
Hanbo Huang , Xuan Gong , Yiran Zhang , Hao Zheng , Shiyu Liang
Published on arXiv
2604.11546
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves 62.0% watermark spoof success rate with minimal semantic shift using only 100 training pairs, compared to 6% baseline with 10,000 samples
RLSpoofer
Novel technique introduced
Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.
Key Contributions
- Establishes 'local capacity bottleneck' theory characterizing probability mass reallocation under KL-bounded updates
- Proposes RLSpoofer, a sample-efficient RL-based watermark spoofing attack requiring only 100 training pairs and zero watermarking internals access
- Demonstrates 62% spoof success rate against PF-Watermark using 4B model, vastly outperforming baselines trained on 10,000 samples
🛡️ Threat Analysis
Attacks LLM content watermarking schemes by forging watermarks to make human text appear AI-generated (or vice versa), directly targeting output integrity and content provenance verification.