Can We Trust LLM Detectors?
Jivnesh Sandhan 1, Harshit Jaiswal 2, Fei Cheng 1, Yugo Murawaki 1
Published on arXiv
2601.15301
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Proposed SCL framework achieves 95.98% accuracy with 100% precision on RAID benchmark, but all detectors — including the proposed method — degrade sharply out-of-domain, confirming no universal detector is achievable with current approaches.
Supervised Contrastive Learning (SCL) for AI text detection
Novel technique introduced
The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI
Key Contributions
- Systematic evaluation showing both training-free and supervised AI text detectors fail severely under distribution shift and unseen generators
- Supervised contrastive learning (SCL) framework using DeBERTa-v3 with InfoNCE loss that learns discriminative style embeddings and enables few-shot adaptation with as few as 25 examples
- Comprehensive adversarial and OOD robustness analysis demonstrating that no current paradigm achieves domain-agnostic detection
🛡️ Threat Analysis
Core contribution is detecting AI-generated text (output integrity/authenticity) — both evaluating existing detectors and proposing a novel SCL-based detection architecture. AI-generated text detection is a canonical ML09 task.