Can We Trust LLM Detectors?

The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI

Key Contributions

Systematic evaluation showing both training-free and supervised AI text detectors fail severely under distribution shift and unseen generators
Supervised contrastive learning (SCL) framework using DeBERTa-v3 with InfoNCE loss that learns discriminative style embeddings and enables few-shot adaptation with as few as 25 examples
Comprehensive adversarial and OOD robustness analysis demonstrating that no current paradigm achieves domain-agnostic detection

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is detecting AI-generated text (output integrity/authenticity) — both evaluating existing detectors and proposing a novel SCL-based detection architecture. AI-generated text detection is a canonical ML09 task.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

CHEATRAIDM4

Applications

2025 0 cit.

Output Integrity Attack

100%

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection

AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

DAMAGE: Detecting Adversarially Modified AI Generated Text

Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

PVMark: Enabling Public Verifiability for LLM Watermarking Schemes

Robustness Assessment and Enhancement of Text Watermarking for Google's SynthID