How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study

As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99\%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1\% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework https://github.com/BaggerOfWords/Sampling-and-Detection

Key Contributions

Large-scale benchmark dataset of LLM-generated texts spanning 37 decoding configurations (temperature, top-p, nucleus sampling, etc.) across six decoding strategies
Systematic evaluation showing state-of-the-art AI text detectors are critically sensitive to sampling parameters, with AUROC dropping from 0.99 to 0.01
In-depth analysis of the mechanisms linking token-level distribution changes to detection success and failure, exposing blind spots in current evaluation protocols

🛡️ Threat Analysis

Output Integrity Attack

Paper is squarely about AI-generated text detection (output integrity/authenticity) — specifically evaluating the robustness of existing AI text detectors when the generator uses varied sampling strategies. Revealing that detectors catastrophically fail under natural decoding variation is a core finding about the reliability of ML09 systems.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

Custom 37-configuration sampling benchmark datasetRAID

Applications

2025 0 cit.

Output Integrity Attack

100%