attack 2025

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Devanshu Sahoo ¹, Manish Prasad ¹, Vasudev Majhi ¹, Jahnvi Singh ¹, Vinay Chamola ¹, Yash Sinha ¹, Murari Mandal ², Dhruv Kumar ¹

¹ BITS Pilani

² KIIT University

2 citations · 44 references · arXiv

Published on arXiv

2512.10449

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Obfuscation attacks 'Maximum Mark Magyk' and 'Symbolic Masking & Context Redirection' achieve Reject-to-Accept decision flip rates of up to 86.26% on open-source LLM reviewers, while exposing distinct reasoning traps in proprietary models.

WAVS

Novel technique introduced

Driven by surging submission volumes, scientific peer review has catalyzed two parallel trends: individual over-reliance on LLMs and institutional AI-powered assessment systems. This study investigates the robustness of "LLM-as-a-Judge" systems to adversarial PDF manipulation via invisible text injections and layout aware encoding attacks. We specifically target the distinct incentive of flipping "Reject" decisions to "Accept," a vulnerability that fundamentally compromises scientific integrity. To measure this, we introduce the Weighted Adversarial Vulnerability Score (WAVS), a novel metric that quantifies susceptibility by weighting score inflation against the severity of decision shifts relative to ground truth. We adapt 15 domain-specific attack strategies, ranging from semantic persuasion to cognitive obfuscation, and evaluate them across 13 diverse language models (including GPT-5 and DeepSeek) using a curated dataset of 200 official and real-world accepted and rejected submissions (e.g., ICLR OpenReview). Our results demonstrate that obfuscation techniques like "Maximum Mark Magyk" and "Symbolic Masking & Context Redirection" successfully manipulate scores, achieving decision flip rates of up to 86.26% in open-source models, while exposing distinct "reasoning traps" in proprietary systems. We release our complete dataset and injection framework to facilitate further research on the topic (https://anonymous.4open.sciencer/llm-jailbreak-FC9E/).

Key Contributions

Introduces WAVS (Weighted Adversarial Vulnerability Score), a metric weighting score inflation against decision-shift severity relative to ground truth
Adapts 15 domain-specific indirect prompt injection strategies (invisible text, font-level encoding, layout manipulation, cognitive obfuscation) targeting scientific peer review LLM judges
Evaluates 13 LLMs (GPT-5, DeepSeek, Claude Haiku, etc.) on 200 ICLR OpenReview papers, demonstrating up to 86.26% Reject-to-Accept flip rates in open-source models

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

ICLR OpenReviewcustom 200-paper benchmark (accepted/rejected submissions)

Applications

scientific peer reviewllm-as-a-judge systems

Read PDF arXiv DOI Code

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Anecdoctoring: Automated Red-Teaming Across Language and Place

Boundary Point Jailbreaking of Black-Box LLMs

When AIOps Become "AI Oops": Subverting LLM-driven IT Operations via Telemetry Manipulation