DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models
Shantanu Thorat , Andrew Caines
Published on arXiv
2508.00619
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
DXO-trained classifiers outperform BCE-trained classifiers by 50.56 macro-F1 points on OOD student essay detection at the lowest false positive rates, while existing detectors broadly fail on one-shot/CPT-generated texts.
DACTYL
Novel technique introduced
Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.
Key Contributions
- DACTYL: a challenging AIG text detection dataset focused on one-shot/few-shot and continued pre-trained (CPT) model generations, exposing a significant gap in existing benchmarks
- Empirical demonstration that most existing AIG text detectors fail significantly on DACTYL, revealing a real-world vulnerability to one-shot/few-shot and CPT-generated texts
- Comparison of BCE vs. DXO optimization for training AIG classifiers, showing DXO generalizes substantially better out-of-distribution (e.g., +50.56 macro-F1 on student essay detection at low FPR)
🛡️ Threat Analysis
Directly addresses AI-generated text detection — the primary contribution is a challenging benchmark (DACTYL) that exposes vulnerabilities in existing AIG text detectors, particularly against one-shot/few-shot and continued pre-trained model outputs. This is content provenance and output integrity research, not a domain application of an existing detector.