defense 2025

Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection

Pablo Miralles-González , Javier Huertas-Tato , Alejandro Martín , David Camacho

0 citations

α

Published on arXiv

2501.03940

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

LLaMA3-1B-based PAWN achieves 81.46% mean macro-averaged F1 in nine-language cross-validation, outperforming fine-tuned LM baselines with a fraction of trainable parameters and better distribution-shift robustness.

PAWN (Perplexity Attention Weighted Network)

Novel technique introduced


The rapid advancement in large language models (LLMs) has significantly enhanced their ability to generate coherent and contextually relevant text, raising concerns about the misuse of AI-generated content and making it critical to detect it. However, the task remains challenging, particularly in unseen domains or with unfamiliar LLMs. Leveraging LLM next-token distribution outputs offers a theoretically appealing approach for detection, as they encapsulate insights from the models' extensive pre-training on diverse corpora. Despite its promise, zero-shot methods that attempt to operationalize these outputs have met with limited success. We hypothesize that one of the problems is that they use the mean to aggregate next-token distribution metrics across tokens, when some tokens are naturally easier or harder to predict and should be weighted differently. Based on this idea, we propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. Although not zero-shot, our method allows us to cache the last hidden states and next-token distribution metrics on disk, greatly reducing the training resource requirements. PAWN shows competitive and even better performance in-distribution than the strongest baselines (fine-tuned LMs) with a fraction of their trainable parameters. Our model also generalizes better to unseen domains and source models, with smaller variability in the decision boundary across distribution shifts. It is also more robust to adversarial attacks, and if the backbone has multilingual capabilities, it presents decent generalization to languages not seen during supervised training, with LLaMA3-1B reaching a mean macro-averaged F1 score of 81.46% in cross-validation with nine languages.


Key Contributions

  • PAWN: a lightweight classifier that uses LLM hidden states and positional encodings to attention-weight perplexity-based token-level features, replacing naive mean aggregation across sequence positions
  • Demonstrates that cached hidden states and next-token distribution metrics drastically reduce training compute compared to fine-tuned LM baselines while matching or exceeding in-distribution performance
  • Shows strong cross-domain, cross-model, adversarial robustness, and multilingual generalization (81.46% macro-F1 across nine languages with LLaMA3-1B backbone)

🛡️ Threat Analysis

Output Integrity Attack

PAWN is a novel AI-generated text detection method — directly addressing output integrity and content provenance by distinguishing human-written from LLM-generated text. The paper proposes a new detection architecture (not merely applying existing methods to a domain), which qualifies as an ML09 contribution.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
ai-generated text detectioncontent authenticity verification