StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.

Key Contributions

StealthRL: GRPO + LoRA fine-tuning on Qwen3-4B that trains a paraphrase policy against a multi-detector ensemble, achieving 0.001 mean TPR@1%FPR across three detector families
Demonstrates cross-architecture transfer to a held-out detector family, exposing shared structural vulnerabilities rather than detector-specific brittleness
Establishes an adversarial evaluation protocol at the security-relevant 1% FPR operating point with LLM-based quality scoring and bootstrap confidence intervals

🛡️ Threat Analysis

Output Integrity Attack

AI-text detectors are content integrity/provenance systems; StealthRL is an evasion attack that defeats them — analogous to watermark removal attacks, defeating content authentication is ML09, not ML01.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

MAGERAID

Applications

2025 0 cit.

Output Integrity Attack

82%

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

LLM Watermark Evasion via Bias Inversion

Character-Level Perturbations Disrupt LLM Watermarks

Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector