Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

AI-generated text (AIGT) detection evasion aims to reduce the detection probability of AIGT, helping to identify weaknesses in detectors and enhance their effectiveness and reliability in practical applications. Although existing evasion methods perform well, they suffer from high computational costs and text quality degradation. To address these challenges, we propose Self-Disguise Attack (SDA), a novel approach that enables Large Language Models (LLM) to actively disguise its output, reducing the likelihood of detection by classifiers. The SDA comprises two main components: the adversarial feature extractor and the retrieval-based context examples optimizer. The former generates disguise features that enable LLMs to understand how to produce more human-like text. The latter retrieves the most relevant examples from an external knowledge base as in-context examples, further enhancing the self-disguise ability of LLMs and mitigating the impact of the disguise process on the diversity of the generated text. The SDA directly employs prompts containing disguise features and optimized context examples to guide the LLM in generating detection-resistant text, thereby reducing resource consumption. Experimental results demonstrate that the SDA effectively reduces the average detection accuracy of various AIGT detectors across texts generated by three different LLMs, while maintaining the quality of AIGT.

Key Contributions

Adversarial feature extractor that uses an iterative adversarial process among a feature generator, text generator, and proxy detector to surface disguise features distinguishing AIGT from human-written text.
Retrieval-based context examples optimizer (RAG-inspired) that selects top-k detection-resistant examples to preserve text diversity while guiding detection evasion.
SDA reduces detection accuracy of multiple AIGT detectors across three LLMs without fine-tuning and with lower computational cost than prior methods.

🛡️ Threat Analysis

Output Integrity Attack

SDA is an evasion attack against AIGT (AI-generated text) detection systems — it undermines output integrity and provenance verification by making LLM-generated text undetectable. Attacking AIGT detectors is a direct ML09 threat against content authenticity systems.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Applications

2025 0 cit.

Output Integrity Attack

83%

Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

LLM Watermark Evasion via Bias Inversion

Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Character-Level Perturbations Disrupt LLM Watermarks

The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Complete Evasion, Zero Modification: PDF Attacks on AI Text Detection