defense 2025

Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

Lekkala Sai Teja ¹, Annepaka Yadagiri ¹, Sangam Sai Anish ¹, Siva Gopala Krishna Nuthakki ², Partha Pakray ¹

¹ National Institute of Technology Silchar

² BML Munjal University

1 citations · 26 references · International Conference on Ub...

Published on arXiv

2510.02319

Output Integrity Attack

OWASP ML Top 10 — ML09

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

PIFE achieves 82.6% TPR at 1% FPR against semantic adversarial attacks (paraphrasing), compared to 48.8% for conventional adversarial training, demonstrating that explicitly modeling perturbation artifacts outperforms training on them.

PIFE (Perturbation-Invariant Feature Engineering)

Novel technique introduced

The growth of highly advanced Large Language Models (LLMs) constitutes a huge dual-use problem, making it necessary to create dependable AI-generated text detection systems. Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique that foils statistical detection. This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training and then by introducing a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering (PIFE), a framework that enhances detection by first transforming input text into a standardized form using a multi-stage normalization pipeline, it then quantifies the transformation's magnitude using metrics like Levenshtein distance and semantic similarity, feeding these signals directly to the classifier. We evaluate both a conventionally hardened Transformer and our PIFE-augmented model against a hierarchical taxonomy of character-, word-, and sentence-level attacks. Our findings first confirm that conventional adversarial training, while resilient to syntactic noise, fails against semantic attacks, an effect we term "semantic evasion threshold", where its True Positive Rate at a strict 1% False Positive Rate plummets to 48.8%. In stark contrast, our PIFE model, which explicitly engineers features from the discrepancy between a text and its canonical form, overcomes this limitation. It maintains a remarkable 82.6% TPR under the same conditions, effectively neutralizing the most sophisticated semantic attacks. This superior performance demonstrates that explicitly modeling perturbation artifacts, rather than merely training on them, is a more promising path toward achieving genuine robustness in the adversarial arms race.

Key Contributions

Introduces PIFE (Perturbation-Invariant Feature Engineering), which normalizes text to a canonical form via a multi-stage pipeline and uses discrepancy metrics (Levenshtein distance, semantic similarity) as explicit classifier features to expose adversarial artifacts
Identifies and quantifies the 'semantic evasion threshold': conventional adversarial training achieves only 48.8% TPR at 1% FPR against semantic attacks, while PIFE achieves 82.6% TPR under the same conditions
Provides a hierarchical adversarial attack taxonomy (character-, word-, and sentence-level) with comparative robustness evaluation against both standard adversarially trained and PIFE-augmented detectors

🛡️ Threat Analysis

Input Manipulation Attack

The paper systematically studies adversarial evasion attacks (character-, word-, and sentence-level perturbations including paraphrasing) designed to cause misclassification by AI-text detectors at inference time, and PIFE is explicitly proposed as a defense against these classifier-evasion attacks — adversarial robustness of the detector is a core, not incidental, contribution.

Output Integrity Attack

Primary contribution is a novel AI-generated text detection framework (PIFE) that identifies whether text is LLM-produced — directly targeting output integrity and content provenance authentication.

Details

Domains

nlp

Model Types

transformerllm

Threat Tags

black_boxinference_timetargeted

Applications

ai-generated text detection

Read PDF arXiv DOI

Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

Variation is the Key: A Variation-Based Framework for LLM-Generated Text Detection

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation