benchmark 2026

Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

Madhav S. Baidya 1, S. S. Baidya 2, Chirag Chawla 1

0 citations

α

Published on arXiv

2603.17522

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Fine-tuned transformers achieve ≥0.994 AUROC in-distribution but degrade universally under domain shift; XGBoost stylometric model matches transformer performance with interpretability; no detector generalizes robustly across both LLM sources and domains


The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.


Key Contributions

  • Comprehensive benchmark of 6 detector families (classical, 5 transformers, CNN, XGBoost, perplexity-based, LLM-as-detector) evaluated on 76,726 samples across HC3 and ELI5 datasets with length-matching preprocessing
  • Cross-LLM generalization evaluation showing universal degradation when test-time generator differs from training, plus adversarial humanization testing at 3 rewriting intensities
  • Discovery of perplexity polarity inversion — modern LLM outputs have lower perplexity than human text, contradicting traditional detection assumptions

🛡️ Threat Analysis

Output Integrity Attack

Paper evaluates detection of AI-generated text across multiple architectures (transformers, CNN, XGBoost, perplexity-based, LLM-as-detector) and tests robustness to adversarial humanization attacks. This is output integrity — verifying whether text was produced by an AI model. Also includes adversarial rewriting attacks that defeat detectors (L1/L2 humanization), which are ML09 attacks on content authenticity systems.


Details

Domains
nlp
Model Types
llmtransformercnntraditional_ml
Threat Tags
black_boxinference_time
Datasets
HC3ELI5
Applications
ai-generated text detectionacademic integritycontent authenticity verification