benchmark 2025

DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models

Shantanu Thorat , Andrew Caines

0 citations

α

Published on arXiv

2508.00619

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

DXO-trained classifiers outperform BCE-trained classifiers by 50.56 macro-F1 points on OOD student essay detection at the lowest false positive rates, while existing detectors broadly fail on one-shot/CPT-generated texts.

DACTYL

Novel technique introduced


Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.


Key Contributions

  • DACTYL: a challenging AIG text detection dataset focused on one-shot/few-shot and continued pre-trained (CPT) model generations, exposing a significant gap in existing benchmarks
  • Empirical demonstration that most existing AIG text detectors fail significantly on DACTYL, revealing a real-world vulnerability to one-shot/few-shot and CPT-generated texts
  • Comparison of BCE vs. DXO optimization for training AIG classifiers, showing DXO generalizes substantially better out-of-distribution (e.g., +50.56 macro-F1 on student essay detection at low FPR)

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated text detection — the primary contribution is a challenging benchmark (DACTYL) that exposes vulnerabilities in existing AIG text detectors, particularly against one-shot/few-shot and continued pre-trained model outputs. This is content provenance and output integrity research, not a domain application of an existing detector.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
DACTYLMAGERAIDstudent essay OOD dataset
Applications
ai-generated text detectionstudent essay detectionacademic integrity