MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.

Key Contributions

MAGA pipeline that augments LLM text generation via alignment (roleplaying, BPO, self-refine, RLDF) to produce MGT closer to human-written text
RLDF (Reinforced Learning from Detectors Feedback) — a novel RL-based method that systematically trains generators to evade MGT detectors
MAGA-Bench dataset covering 20 domains, 12 generators, and 936k entries, causing 8.13% AUC decrease in existing detectors while improving RoBERTa generalization by 4.60%

🛡️ Threat Analysis

Output Integrity Attack

Core focus is on AI-generated content detection — the paper both attacks existing MGT detectors (8.13% AUC drop) using alignment-enhanced text and improves detector generalization, directly addressing the output integrity problem of distinguishing AI-generated from human-written text. RLDF specifically optimizes generation to evade content authenticity detectors.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

MAGAMGBM4RAIDRedditS2ORCWikipediaCC News

Applications

2025 0 cit.

Output Integrity Attack

100%