defense 2026

Can We Trust LLM Detectors?

Jivnesh Sandhan 1, Harshit Jaiswal 2, Fei Cheng 1, Yugo Murawaki 1

0 citations · 16 references · arXiv

α

Published on arXiv

2601.15301

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Proposed SCL framework achieves 95.98% accuracy with 100% precision on RAID benchmark, but all detectors — including the proposed method — degrade sharply out-of-domain, confirming no universal detector is achievable with current approaches.

Supervised Contrastive Learning (SCL) for AI text detection

Novel technique introduced


The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI


Key Contributions

  • Systematic evaluation showing both training-free and supervised AI text detectors fail severely under distribution shift and unseen generators
  • Supervised contrastive learning (SCL) framework using DeBERTa-v3 with InfoNCE loss that learns discriminative style embeddings and enables few-shot adaptation with as few as 25 examples
  • Comprehensive adversarial and OOD robustness analysis demonstrating that no current paradigm achieves domain-agnostic detection

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is detecting AI-generated text (output integrity/authenticity) — both evaluating existing detectors and proposing a novel SCL-based detection architecture. AI-generated text detection is a canonical ML09 task.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
CHEATRAIDM4
Applications
ai-generated text detectionacademic integrityllm output detection