defense 2025

Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector

Mo Mu , Dianqiao Lei , Chang Li

0 citations · 33 references · arXiv

α

Published on arXiv

2509.18535

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Sentence-level structural detection outperforms word-level baselines on both original and paraphrase/simple-prompt-modified ChatGPT text across multilingual multi-domain benchmarks.


The widespread adoption of ChatGPT has raised concerns about its misuse, highlighting the need for robust detection of AI-generated text. Current word-level detectors are vulnerable to paraphrasing or simple prompts (PSP), suffer from biases induced by ChatGPT's word-level patterns (CWP) and training data content, degrade on modified text, and often require large models or online LLM interaction. To tackle these issues, we introduce a novel task to detect both original and PSP-modified AI-generated texts, and propose a lightweight framework that classifies texts based on their internal structure, which remains invariant under word-level changes. Our approach encodes sentence embeddings from pre-trained language models and models their relationships via attention. We employ contrastive learning to mitigate embedding biases from autoregressive generation and incorporate a causal graph with counterfactual methods to isolate structural features from topic-related biases. Experiments on two curated datasets, including abstract comparisons and revised life FAQs, validate the effectiveness of our method.


Key Contributions

  • Identifies and formalizes word-level pattern (CWP) bias in ChatGPT text detectors via causal graph analysis, explaining vulnerability to paraphrase/simple-prompt (PSP) attacks
  • Proposes a lightweight detection framework encoding inter-sentence structural relations with contrastive learning and counterfactual causal methods to decouple structure from word-level and topic biases
  • Constructs and releases a large-scale multilingual benchmark (263,595 English + 76,503 Chinese samples) with PSP variants including cyclic translation, synonym substitution, and diverse prompts

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated text detection method — classifying whether text was produced by ChatGPT. This is squarely output integrity/content authenticity. The paper contributes new detection architecture (inter-sentence attention, contrastive learning, causal counterfactual debiasing) rather than applying existing methods to a new domain.


Details

Domains
nlp
Model Types
transformerllm
Threat Tags
inference_timeblack_box
Datasets
HC3Arxiv abstracts (custom)life FAQ dataset (custom)
Applications
ai-generated text detectionacademic integritymisinformation detection