When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
Devanshu Sahoo 1, Manish Prasad 1, Vasudev Majhi 1, Jahnvi Singh 1, Vinay Chamola 1, Yash Sinha 1, Murari Mandal 2, Dhruv Kumar 1
Published on arXiv
2512.10449
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Obfuscation attacks 'Maximum Mark Magyk' and 'Symbolic Masking & Context Redirection' achieve Reject-to-Accept decision flip rates of up to 86.26% on open-source LLM reviewers, while exposing distinct reasoning traps in proprietary models.
WAVS
Novel technique introduced
Driven by surging submission volumes, scientific peer review has catalyzed two parallel trends: individual over-reliance on LLMs and institutional AI-powered assessment systems. This study investigates the robustness of "LLM-as-a-Judge" systems to adversarial PDF manipulation via invisible text injections and layout aware encoding attacks. We specifically target the distinct incentive of flipping "Reject" decisions to "Accept," a vulnerability that fundamentally compromises scientific integrity. To measure this, we introduce the Weighted Adversarial Vulnerability Score (WAVS), a novel metric that quantifies susceptibility by weighting score inflation against the severity of decision shifts relative to ground truth. We adapt 15 domain-specific attack strategies, ranging from semantic persuasion to cognitive obfuscation, and evaluate them across 13 diverse language models (including GPT-5 and DeepSeek) using a curated dataset of 200 official and real-world accepted and rejected submissions (e.g., ICLR OpenReview). Our results demonstrate that obfuscation techniques like "Maximum Mark Magyk" and "Symbolic Masking & Context Redirection" successfully manipulate scores, achieving decision flip rates of up to 86.26% in open-source models, while exposing distinct "reasoning traps" in proprietary systems. We release our complete dataset and injection framework to facilitate further research on the topic (https://anonymous.4open.sciencer/llm-jailbreak-FC9E/).
Key Contributions
- Introduces WAVS (Weighted Adversarial Vulnerability Score), a metric weighting score inflation against decision-shift severity relative to ground truth
- Adapts 15 domain-specific indirect prompt injection strategies (invisible text, font-level encoding, layout manipulation, cognitive obfuscation) targeting scientific peer review LLM judges
- Evaluates 13 LLMs (GPT-5, DeepSeek, Claude Haiku, etc.) on 200 ICLR OpenReview papers, demonstrating up to 86.26% Reject-to-Accept flip rates in open-source models