defense 2025

DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models

Jiachen Fu 1, Chun-Le Guo 1, Chongyi Li 1,2

0 citations

α

Published on arXiv

2509.14268

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

DetectAnyLLM achieves over 70% performance improvement over existing detectors under the same training data and base scoring model on the MIRAGE benchmark.

Direct Discrepancy Learning (DDL) / DetectAnyLLM

Novel technique introduced


The rapid advancement of large language models (LLMs) has drawn urgent attention to the task of machine-generated text detection (MGTD). However, existing approaches struggle in complex real-world scenarios: zero-shot detectors rely heavily on scoring model's output distribution while training-based detectors are often constrained by overfitting to the training data, limiting generalization. We found that the performance bottleneck of training-based detectors stems from the misalignment between training objective and task needs. To address this, we propose Direct Discrepancy Learning (DDL), a novel optimization strategy that directly optimizes the detector with task-oriented knowledge. DDL enables the detector to better capture the core semantics of the detection task, thereby enhancing both robustness and generalization. Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance across diverse LLMs. To ensure a reliable evaluation, we construct MIRAGE, the most diverse multi-task MGTD benchmark. MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs, covering a wide spectrum of proprietary models and textual styles. Extensive experiments on MIRAGE reveal the limitations of existing methods in complex environment. In contrast, DetectAnyLLM consistently outperforms them, achieving over a 70% performance improvement under the same training data and base scoring model, underscoring the effectiveness of our DDL. Project page: {https://fjc2005.github.io/detectanyllm}.


Key Contributions

  • Direct Discrepancy Learning (DDL): a novel training objective that aligns detector optimization directly with MGTD task semantics, improving generalization and robustness over standard supervised training
  • DetectAnyLLM: a unified detection framework achieving state-of-the-art performance across diverse LLMs, domains, and writing styles
  • MIRAGE benchmark: the most diverse MGTD evaluation benchmark, covering 10 corpora across 5 domains re-generated by 17 cutting-edge LLMs, enabling rigorous real-world evaluation

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is detecting AI/LLM-generated text (output integrity/content authenticity) via a novel detection framework (DetectAnyLLM) and optimization strategy (DDL), directly addressing the challenge of verifying whether text outputs are human- or machine-generated.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
MIRAGE
Applications
machine-generated text detectionai text attributionacademic integrity