How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study
Matthieu Dubois 1, François Yvon 1, Pablo Piantanida 2,3,4
Published on arXiv
2510.13681
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Minor adjustments to LLM decoding parameters (temperature, top-p, nucleus sampling) can reduce state-of-the-art AI text detector AUROC from 0.99 to as low as 0.01, revealing severe robustness failures in current detection systems.
As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99\%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1\% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework https://github.com/BaggerOfWords/Sampling-and-Detection
Key Contributions
- Large-scale benchmark dataset of LLM-generated texts spanning 37 decoding configurations (temperature, top-p, nucleus sampling, etc.) across six decoding strategies
- Systematic evaluation showing state-of-the-art AI text detectors are critically sensitive to sampling parameters, with AUROC dropping from 0.99 to 0.01
- In-depth analysis of the mechanisms linking token-level distribution changes to detection success and failure, exposing blind spots in current evaluation protocols
🛡️ Threat Analysis
Paper is squarely about AI-generated text detection (output integrity/authenticity) — specifically evaluating the robustness of existing AI text detectors when the generator uses varied sampling strategies. Revealing that detectors catastrophically fail under natural decoding variation is a core finding about the reliability of ML09 systems.