defense 2025

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Cong Zeng ¹, Shengkun Tang ², Yuanzhou Chen ¹, Zhiqiang Shen ², Wenchao Yu ³, Xujiang Zhao ², Haifeng Chen ¹, Wei Cheng ², Zhiqiang Xu ¹

¹ MBZUAI

² NEC Laboratories America

³ University of California, Los Angeles

1 citations · 82 references · arXiv

Published on arXiv

2510.08602

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

OOD-based detection achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on the DeepFake dataset, outperforming binary classifiers across multilingual and adversarial robustness settings.

OOD-based LLM text detection (DeepSVDD / HRN / Energy-based)

Novel technique introduced

The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID' behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

Key Contributions

Reframes LLM-generated text detection as an out-of-distribution detection problem, treating human texts as distributional outliers rather than a coherent in-distribution class
Detection framework combining one-class learning methods (DeepSVDD, HRN) and energy-based scoring to enable robust, generalizable detection
Demonstrates strong generalization across multilingual, adversarially attacked, and unseen-model/unseen-domain text settings, achieving 98.3% AUROC on the DeepFake dataset

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection — distinguishing LLM-generated text from human-authored text. The paper's primary contribution is a novel detection methodology (OOD framing with DeepSVDD, HRN, and energy-based scoring) for verifying the authenticity and provenance of text content, which is a core ML09 concern.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

DeepFake dataset

Applications

ai-generated text detectioncontent authenticity verification

Read PDF arXiv DOI

Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector

SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

Large Language Models Are Effective Code Watermarkers

When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

Simplex-Optimized Hybrid Ensemble for Large Language Model Text Detection Under Generative Distribution Drif

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text