PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.

Key Contributions

Intrinsic mechanism analysis revealing that iterative paraphrasing creates an 'intermediate laundering region' with semantic displacement but preserved generation patterns, explaining why AIGT detectors fail
PADBen: a five-type text taxonomy and five progressive detection tasks systematically covering authorship obfuscation and plagiarism evasion attack scenarios
Evaluation of 11 state-of-the-art detectors exposing a critical asymmetry — detectors handle plagiarism evasion but fail completely on authorship obfuscation

🛡️ Threat Analysis

Output Integrity Attack

PADBen evaluates the robustness of AI-generated text detectors — a core output integrity concern. The paraphrase attacks studied are evasion attacks specifically designed to defeat AIGT detection systems. The paper reveals a critical 'intermediate laundering region' that creates systematic blind spots in detection, and the benchmark directly measures detector failure under both authorship obfuscation and plagiarism evasion attack scenarios.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

PADBen (introduced in paper)

Applications

2025 0 cit.

Output Integrity Attack

100%