defense 2025

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

L. D. M. S. Sai Teja ¹, N. Siva Gopala Krishna ², Ufaq Khan ³, Muhammad Haris Khan ³, Atul Mishra ²

¹ National Institute of Technology Silchar

² BML Munjal University

³ Mohamed bin Zayed University of Artificial Intelligence

0 citations · 45 references · arXiv

Published on arXiv

2512.04838

Output Integrity Attack

OWASP ML Top 10 — ML09

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Info-Mask significantly improves span-level robustness under adversarial conditions across multiple architectures, establishing new baselines for mixed human-AI authorship detection under adversarial perturbation

Info-Mask

Novel technique introduced

In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.

Key Contributions

Info-Mask framework integrating stylometric cues, perplexity-driven signals, and structured boundary modeling for adversarially robust mixed-authorship text segmentation
MAS adversarial benchmark dataset for evaluating AI text detector robustness against NLP-level evasion attacks (word/character substitution, paraphrasing)
Human-Interpretable Attribution (HIA) overlays that surface stylometric features driving boundary predictions, validated in a human study

🛡️ Threat Analysis

Input Manipulation Attack

Constructs the MAS adversarial benchmark using text-level evasion attacks (word substitution, paraphrasing via BAE, HotFlip, etc.) targeting the AI content detector at inference time, and Info-Mask explicitly defends against these adversarial perturbations.

Output Integrity Attack

Primary contribution is a novel framework (Info-Mask) for detecting AI-generated content in mixed-authorship text, directly targeting output integrity and AI content authenticity/provenance.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timedigitalblack_box

Datasets

MASCoAuthorMixSetRoFT-chatgpt

Applications

ai-generated text detectionmixed authorship segmentationacademic integritycontent authenticity

Read PDF arXiv DOI Code

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

PVMark: Enabling Public Verifiability for LLM Watermarking Schemes

When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders