When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection
Lang Gao 1, Xuhui Li 1, Chenxi Wang 1, Mingzhe Li 2, Wei Liu 3, Zirui Song 1, Jinghui Zhang 1, Rui Yan 4, Preslav Nakov 1, Xiuying Chen 1
Published on arXiv
2510.12476
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
The proposed diagnostic framework predicts detector performance gaps in personalized settings with 85% correlation to actual empirical results, exposing widespread vulnerability in state-of-the-art MGT detectors.
Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85\% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.
Key Contributions
- First benchmark dataset for personalized MGT detection, pairing authentic literary and blog texts with their LLM-generated stylistic imitations
- Identification and characterization of the feature-inversion trap, where features discriminative in general domains flip direction in personalized settings and mislead detectors
- A probe-based diagnostic framework that predicts the direction and magnitude of detector performance changes under personalization, achieving 85% correlation with empirical results
🛡️ Threat Analysis
Directly addresses AI-generated content detection: introduces a benchmark for evaluating MGT detectors, analyzes why they fail (feature-inversion trap), and proposes a method to predict detector robustness — all squarely within content authenticity and output integrity.