Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.

Key Contributions

Comprehensive benchmark of 6 detector families (classical, 5 transformers, CNN, XGBoost, perplexity-based, LLM-as-detector) evaluated on 76,726 samples across HC3 and ELI5 datasets with length-matching preprocessing
Cross-LLM generalization evaluation showing universal degradation when test-time generator differs from training, plus adversarial humanization testing at 3 rewriting intensities
Discovery of perplexity polarity inversion — modern LLM outputs have lower perplexity than human text, contradicting traditional detection assumptions

🛡️ Threat Analysis

Output Integrity Attack

Paper evaluates detection of AI-generated text across multiple architectures (transformers, CNN, XGBoost, perplexity-based, LLM-as-detector) and tests robustness to adversarial humanization attacks. This is output integrity — verifying whether text was produced by an AI model. Also includes adversarial rewriting attacks that defeat detectors (L1/L2 humanization), which are ML09 attacks on content authenticity systems.