attack 2026

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

Hanbo Huang , Xuan Gong , Yiran Zhang , Hao Zheng , Shiyu Liang

0 citations

α

Published on arXiv

2604.11546

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves 62.0% watermark spoof success rate with minimal semantic shift using only 100 training pairs, compared to 6% baseline with 10,000 samples

RLSpoofer

Novel technique introduced


Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.


Key Contributions

  • Establishes 'local capacity bottleneck' theory characterizing probability mass reallocation under KL-bounded updates
  • Proposes RLSpoofer, a sample-efficient RL-based watermark spoofing attack requiring only 100 training pairs and zero watermarking internals access
  • Demonstrates 62% spoof success rate against PF-Watermark using 4B model, vastly outperforming baselines trained on 10,000 samples

🛡️ Threat Analysis

Output Integrity Attack

Attacks LLM content watermarking schemes by forging watermarks to make human text appear AI-generated (or vice versa), directly targeting output integrity and content provenance verification.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
ai-generated text detectioncontent provenancewatermark verification