attack 2025

RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks

Hanbo Huang ¹, Yiran Zhang ¹, Hao Zheng ¹, Xuan Gong ¹, Yihan Li ², Lin Liu ², Shiyu Liang ¹

¹ Shanghai Jiao Tong University

² National University of Defense Technology

0 citations · 45 references · arXiv

Published on arXiv

2509.20924

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

RLCracker achieves 98.5% watermark removal success and 0.92 P-SP semantic score on 1,500-token Unigram-marked texts trained on only 100 short samples, dramatically exceeding GPT-4o's 6.75% baseline and generalizing across 10 watermarking schemes.

RLCracker

Novel technique introduced

Large Language Models (LLMs) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce adaptive robustness radius, a formal metric that quantifies watermark resilience against adaptive adversaries. We theoretically prove that optimizing the attack context and model parameters can substantially reduce this radius, making watermarks highly susceptible to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success and an average 0.92 P-SP score on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our results confirm that adaptive attacks are broadly effective and pose a fundamental threat to current watermarking defenses.

Key Contributions

Introduces 'adaptive robustness radius', a formal metric quantifying watermark resilience against adaptive adversaries, with theoretical proof that optimizing attack context and model parameters dramatically collapses it
Proposes RLCracker, an RL-based adaptive paraphrase attack that removes LLM text watermarks with zero detector access and only 100 short training samples
Demonstrates 98.5% watermark removal success on 1,500-token texts with a 3B model, generalizing across five model sizes and ten watermarking schemes

🛡️ Threat Analysis

Output Integrity Attack

RLCracker is a watermark removal attack targeting content watermarks embedded in LLM text outputs — the core ML09 threat of attacking output integrity and content provenance. The paper directly attacks watermarking schemes designed to detect AI-generated content, defeating 10 different schemes.

Details

Domains

nlp

Model Types

llmrl

Threat Tags

black_boxinference_time

Datasets

Unigram-watermarked text corpus (1,500-token texts, 100-sample training set)

Applications

llm text watermarkingai-generated content detection

Read PDF arXiv DOI

RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

LLM Watermark Evasion via Bias Inversion

The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Character-Level Perturbations Disrupt LLM Watermarks

Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization