benchmark 2025

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

Changjia Zhu 1, Junjie Xiong 2, Renkai Ma 3, Zhicong Lu 4, Yao Liu 1, Lingyao Li 1

0 citations

α

Published on arXiv

2509.09912

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Field-specific prompt injection instructions embedded in paper PDFs successfully manipulate specific aspects of LLM reviews, while general malicious prompts cause only minor topical shifts.


Peer review is the cornerstone of academic publishing, yet the process is increasingly strained by rising submission volumes, reviewer overload, and expertise mismatches. Large language models (LLMs) are now being used as "reviewer aids," raising concerns about their fairness, consistency, and robustness against indirect prompt injection attacks. This paper presents a systematic evaluation of LLMs as academic reviewers. Using a curated dataset of 1,441 papers from ICLR 2023 and NeurIPS 2022, we evaluate GPT-5-mini against human reviewers across ratings, strengths, and weaknesses. The evaluation employs structured prompting with reference paper calibration, topic modeling, and similarity analysis to compare review content. We further embed covert instructions into PDF submissions to assess LLMs' susceptibility to prompt injection. Our findings show that LLMs consistently inflate ratings for weaker papers while aligning more closely with human judgments on stronger contributions. Moreover, while overarching malicious prompts induce only minor shifts in topical focus, explicitly field-specific instructions successfully manipulate specific aspects of LLM-generated reviews. This study underscores both the promises and perils of integrating LLMs into peer review and points to the importance of designing safeguards that ensure integrity and trust in future review processes.


Key Contributions

  • Systematic empirical evaluation of LLM (GPT-5-mini) vs. human reviewer alignment on 1,441 ICLR 2023 and NeurIPS 2022 papers across ratings, strengths, and weaknesses
  • Demonstration that field-specific covert instructions embedded in PDF submissions successfully manipulate LLM-generated review content and ratings via indirect prompt injection
  • Finding that LLMs inflate ratings for weaker papers while aligning more closely with human judgments on stronger contributions

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
ICLR 2023 reviewsNeurIPS 2022 reviews
Applications
academic peer reviewllm-assisted reviewing systems