defense 2025

Real, Fake, or Manipulated? Detecting Machine-Influenced Text

Yitong Wang 1, Zhongping Zhang 1, Margherita Piana 1, Zheng Zhou 2, Peter Gerstoft 2, Bryan A. Plummer 1

0 citations

α

Published on arXiv

2509.15350

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

HERO outperforms state-of-the-art by 2.5–3 mAP on average across five LLMs and six domains on fine-grained machine-influenced text detection

HERO

Novel technique introduced


Large Language Model (LLMs) can be used to write or modify documents, presenting a challenge for understanding the intent behind their use. For example, benign uses may involve using LLM on a human-written document to improve its grammar or to translate it into another language. However, a document entirely produced by a LLM may be more likely to be used to spread misinformation than simple translation (\eg, from use by malicious actors or simply by hallucinating). Prior works in Machine Generated Text (MGT) detection mostly focus on simply identifying whether a document was human or machine written, ignoring these fine-grained uses. In this paper, we introduce a HiErarchical, length-RObust machine-influenced text detector (HERO), which learns to separate text samples of varying lengths from four primary types: human-written, machine-generated, machine-polished, and machine-translated. HERO accomplishes this by combining predictions from length-specialist models that have been trained with Subcategory Guidance. Specifically, for categories that are easily confused (\eg, different source languages), our Subcategory Guidance module encourages separation of the fine-grained categories, boosting performance. Extensive experiments across five LLMs and six domains demonstrate the benefits of our HERO, outperforming the state-of-the-art by 2.5-3 mAP on average.


Key Contributions

  • Four-class fine-grained classification of machine-influenced text (human, generated, polished, translated) beyond binary human/machine detection
  • Hierarchical architecture combining length-specialist models to handle variable-length inputs
  • Subcategory Guidance module that encourages separation of easily confused fine-grained categories (e.g., different source languages)

🛡️ Threat Analysis

Output Integrity Attack

HERO is a novel AI-generated content detection architecture targeting LLM output provenance — distinguishing human, machine-generated, machine-polished, and machine-translated text. This is a direct output integrity contribution, not a mere domain application of existing detectors.


Details

Domains
nlp
Model Types
transformerllm
Threat Tags
inference_time
Applications
ai-generated text detectionmisinformation detectiondocument authenticity verification