tool 2026

Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments

Ashish Raj Shekhar , Shiven Agarwal , Priyanuj Bordoloi , Yash Shah , Tejas Anvekar , Vivek Gupta

0 citations · 16 references · arXiv

α

Published on arXiv

2601.11093

Output Integrity Attack

OWASP ML Top 10 — ML09

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Achieves 91-94% exam-level MLLM blocking and 89-93% authorship signature retrieval across four commercial MLLMs on 30 diverse exams

Integrity Shield

Novel technique introduced


Large Language Models (LLMs) can now solve entire exams directly from uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model's decoding process, making them ineffective when students query proprietary black-box systems with instructor-provided documents. We present Integrity Shield, a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 exams spanning STEM, humanities, and medical reasoning, Integrity Shield achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance & authorship evidence.


Key Contributions

  • Document-layer watermarking system embedding schema-aware, invisible watermarks into exam PDFs that simultaneously prevent MLLM cheating (91-94% blocking) and enable authorship detection (89-93% signature retrieval) across four commercial MLLMs
  • LLM-driven strategy planner that adapts watermark tactics to question schema (MCQ, true/false, long-form) without altering human-visible content
  • Interactive instructor-facing demo for exam upload, watermark preview, and pre/post AI performance and authorship evidence reporting

🛡️ Threat Analysis

Input Manipulation Attack

Prevention mode uses document-layer adversarial perturbations (invisible text, glyph remappings, overlays) that exploit the render-parse gap to cause MLLMs to fail at inference time, achieving 91-94% exam-level blocking across black-box commercial MLLMs.

Output Integrity Attack

Core contribution is watermarking exam PDFs with recoverable signatures that propagate to model outputs, enabling AI authorship detection and attribution — 89-93% signature retrieval from MLLM responses across 30 exams.


Details

Domains
nlpmultimodal
Model Types
vlmllm
Threat Tags
inference_timeblack_box
Datasets
30 custom exam PDFs (STEM, humanities, medical reasoning)
Applications
academic integrityexam assessmentai-generated content detection