defense 2026

DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

Ashish Raj Shekhar , Shiven Agarwal , Priyanuj Bordoloi , Yash Shah , Tejas Anvekar , Vivek Gupta

Arizona State University

0 citations · 32 references · arXiv

Published on arXiv

2601.12505

Output Integrity Attack

OWASP ML Top 10 — ML09

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Against black-box OpenAI and Anthropic MLLMs, DoPE achieves 91.4% detection at 8.7% FPR and prevents successful completion or induces decoy-aligned failures in 96.3% of attempts

DoPE (Decoy-Oriented Perturbation Encapsulation)

Novel technique introduced

Multimodal Large Language Models (MLLMs) can directly consume exam documents, threatening conventional assessments and academic integrity. We present DoPE (Decoy-Oriented Perturbation Encapsulation), a document-layer defense framework that embeds semantic decoys into PDF/HTML assessments to exploit render-parse discrepancies in MLLM pipelines. By instrumenting exams at authoring time, DoPE provides model-agnostic prevention (stop or confound automated solving) and detection (flag blind AI reliance) without relying on conventional one-shot classifiers. We formalize prevention and detection tasks, and introduce FewSoRT-Q, an LLM-guided pipeline that generates question-level semantic decoys and FewSoRT-D to encapsulate them into watermarked documents. We evaluate on Integrity-Bench, a novel benchmark of 1826 exams (PDF+HTML) derived from public QA datasets and OpenCourseWare. Against black-box MLLMs from OpenAI and Anthropic, DoPE yields strong empirical gains: a 91.4% detection rate at an 8.7% false-positive rate using an LLM-as-Judge verifier, and prevents successful completion or induces decoy-aligned failures in 96.3% of attempts. We release Integrity-Bench, our toolkit, and evaluation code to enable reproducible study of document-layer defenses for academic integrity.

Key Contributions

DoPE framework that embeds semantic decoys into PDF/HTML exam documents by exploiting render-parse discrepancies in MLLM pipelines, achieving 96.3% prevention of successful MLLM-assisted completion
FewSoRT-Q/D pipeline: LLM-guided generation of question-level semantic decoys (FewSoRT-Q) and their encapsulation into visually unchanged, watermarked documents (FewSoRT-D)
Integrity-Bench: a novel benchmark of 1,826 paired PDF+HTML exams with multiple watermarked variants, enabling controlled evaluation of document-layer defenses against black-box MLLMs

🛡️ Threat Analysis

Output Integrity Attack

The detection component watermarks exam documents with semantic decoys to identify when MLLMs are used to solve exams — verifying content integrity via an LLM-as-Judge verifier that flags decoy-aligned answers as evidence of AI reliance.

Details

Domains

nlpmultimodal

Model Types

llmvlm

Threat Tags

black_boxinference_time

Datasets

Integrity-Bench

Applications

academic assessment integrityai-assisted cheating detectionexam document protection

Read PDF arXiv DOI

DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Clouding the Mirror: Stealthy Prompt Injection Attacks Targeting LLM-based Phishing Detection

Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Countermind: A Multi-Layered Security Architecture for Large Language Models

AegisAgent: An Autonomous Defense Agent Against Prompt Injection Attacks in LLM-HARs

Risk Assessment and Security Analysis of Large Language Models