DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models

Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.

Key Contributions

DACTYL: a challenging AIG text detection dataset focused on one-shot/few-shot and continued pre-trained (CPT) model generations, exposing a significant gap in existing benchmarks
Empirical demonstration that most existing AIG text detectors fail significantly on DACTYL, revealing a real-world vulnerability to one-shot/few-shot and CPT-generated texts
Comparison of BCE vs. DXO optimization for training AIG classifiers, showing DXO generalizes substantially better out-of-distribution (e.g., +50.56 macro-F1 on student essay detection at low FPR)

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated text detection — the primary contribution is a challenging benchmark (DACTYL) that exposes vulnerabilities in existing AIG text detectors, particularly against one-shot/few-shot and continued pre-trained model outputs. This is content provenance and output integrity research, not a domain application of an existing detector.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

DACTYLMAGERAIDstudent essay OOD dataset

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection

The Erosion of LLM Signatures: Can We Still Distinguish Human and LLM-Generated Scientific Ideas After Iterative Paraphrasing?

How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study

Signature vs. Substance: Evaluating the Balance of Adversarial Resistance and Linguistic Quality in Watermarking Large Language Models

Analyzing and Evaluating Unbiased Language Model Watermark

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark