attack 2026

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Yongtong Gu 1, Songze Li 1, Xia Hu 2

0 citations · 44 references · arXiv

α

Published on arXiv

2601.08564

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

MASH achieves 92% average Attack Success Rate against 5 AIGT detectors across 6 datasets, outperforming the strongest baseline by 24%.

MASH (Multi-stage Alignment for Style Humanization)

Novel technique introduced


The increasing misuse of AI-generated texts (AIGT) has motivated the rapid development of AIGT detection methods. However, the reliability of these detectors remains fragile against adversarial evasions. Existing attack strategies often rely on white-box assumptions or demand prohibitively high computational and interaction costs, rendering them ineffective under practical black-box scenarios. In this paper, we propose Multi-stage Alignment for Style Humanization (MASH), a novel framework that evades black-box detectors based on style transfer. MASH sequentially employs style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to shape the distributions of AI-generated texts to resemble those of human-written texts. Experiments across 6 datasets and 5 detectors demonstrate the superior performance of MASH over 11 baseline evaders. Specifically, MASH achieves an average Attack Success Rate (ASR) of 92%, surpassing the strongest baselines by an average of 24%, while maintaining superior linguistic quality.


Key Contributions

  • MASH framework combining Style-injection SFT, DPO alignment, and inference-time refinement to evade black-box AIGT detectors via style transfer
  • Demonstrates that a 0.1B parameter model optimized via MASH can outperform much larger models in evasion performance with limited detector interactions
  • Achieves 92% average ASR across 6 datasets and 5 detectors, surpassing strongest baselines by 24% while preserving linguistic quality

🛡️ Threat Analysis

Output Integrity Attack

The paper directly attacks AI-generated content detection systems — output integrity verification mechanisms — by training a style-transfer model to make AI-generated text indistinguishable from human-written text, causing misclassification in AIGT detectors. ML09 explicitly covers AI-generated content detection and attacks on such systems.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
6 domain datasets (names not fully specified in excerpt)
Applications
ai-generated text detectiontext authenticity verification