HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model's token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.

Key Contributions

Human Language Preference Optimization (HLPO): a reward-based alignment process that shifts a scoring model's token distribution toward human-like writing to improve machine-revised text detection
Adversarial multi-task evaluation framework using a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios (polish, expand, rewrite)
HLPD achieves 15.11% relative AUROC improvement over ImBD and 45.56% over Fast-DetectGPT on GPT-series-revised texts in black-box settings

🛡️ Threat Analysis

Output Integrity Attack

Primary contribution is a novel AI-generated/machine-revised text detection architecture (HLPD/HLPO). The paper proposes a fundamentally new detection methodology — reward-based human language preference optimization — to distinguish human-written from machine-revised text, directly addressing output integrity and content provenance.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

GPT-3.5 revised textsGPT-4o revised textsadversarial multi-task revision benchmark

Applications

2025 0 cit.

Output Integrity Attack

100%

HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

Robustness Assessment and Enhancement of Text Watermarking for Google's SynthID

Who Stole Your Data? A Method for Detecting Unauthorized RAG Theft

RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection

RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

PVMark: Enabling Public Verifiability for LLM Watermarking Schemes

Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector