RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.

Key Contributions

Establishes 'local capacity bottleneck' theory characterizing probability mass reallocation under KL-bounded updates
Proposes RLSpoofer, a sample-efficient RL-based watermark spoofing attack requiring only 100 training pairs and zero watermarking internals access
Demonstrates 62% spoof success rate against PF-Watermark using 4B model, vastly outperforming baselines trained on 10,000 samples

🛡️ Threat Analysis

Output Integrity Attack

Attacks LLM content watermarking schemes by forging watermarks to make human text appear AI-generated (or vice versa), directly targeting output integrity and content provenance verification.

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

2025 0 cit.

Output Integrity Attack

90%