benchmark 2026

MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

Anyang Song , Ying Cheng , Yiqian Xu , Rui Feng

0 citations · 62 references · arXiv

α

Published on arXiv

2601.04633

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

MAGA Dataset caused an average 8.13% AUC decrease across selected MGT detectors, while RoBERTa fine-tuned on MAGA training data improved generalization detection AUC by 4.60%.

RLDF (Reinforced Learning from Detectors Feedback)

Novel technique introduced


Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.


Key Contributions

  • MAGA pipeline that augments LLM text generation via alignment (roleplaying, BPO, self-refine, RLDF) to produce MGT closer to human-written text
  • RLDF (Reinforced Learning from Detectors Feedback) — a novel RL-based method that systematically trains generators to evade MGT detectors
  • MAGA-Bench dataset covering 20 domains, 12 generators, and 936k entries, causing 8.13% AUC decrease in existing detectors while improving RoBERTa generalization by 4.60%

🛡️ Threat Analysis

Output Integrity Attack

Core focus is on AI-generated content detection — the paper both attacks existing MGT detectors (8.13% AUC drop) using alignment-enhanced text and improves detector generalization, directly addressing the output integrity problem of distinguishing AI-generated from human-written text. RLDF specifically optimizes generation to evade content authenticity detectors.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
MAGAMGBM4RAIDRedditS2ORCWikipediaCC News
Applications
machine-generated text detectionai-generated content detection