benchmark 2025

Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Janis Keuper

0 citations

α

Published on arXiv

2509.10248

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Hidden prompt injections in paper PDFs achieve up to 100% LLM acceptance scores, while LLMs are also generally biased toward acceptance (>95%) even without injections


The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such "attacks" - although seen by some commentators as "self-defense" - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.


Key Contributions

  • First systematic empirical evaluation of the practical effectiveness of hidden prompt injection attacks on LLM-generated scientific reviews
  • Demonstrates that simple prompt injections (hidden text) achieve up to 100% acceptance scores across 1k ICLR 2024 papers reviewed by multiple LLMs
  • Reveals that LLM reviews are generally biased toward acceptance (>95% in many models) independent of prompt injection

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
ICLR 2024 OpenReview
Applications
scientific peer reviewdocument review