attack 2025

Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review

Matteo Gioele Collu ¹, Umberto Salviati ¹, Roberto Confalonieri ¹, Mauro Conti ^1,2, Giovanni Apruzzese ³

¹ University of Padua

² Örebro University

³ University of Liechtenstein

0 citations

Published on arXiv

2508.20863

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Adversarial prompts embedded invisibly in paper PDFs reliably mislead commercial LLMs acting as peer reviewers, with robustness demonstrated across multiple systems and reviewing prompt variants

Hidden Prompt Injection

Novel technique introduced

Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

Key Contributions

Formalizes three distinct threat models for prompt injection in LLM-assisted peer review, covering attackers with varying motivations and knowledge levels
Designs adversarial prompts invisible to human readers that reliably steer LLM review outputs across different commercial systems, reviewing prompt styles, and paper content
Proposes and empirically evaluates evasion methods to reduce adversarial prompt detectability under automated content-check defenses

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeteddigital

Applications

llm-assisted peer reviewscientific paper evaluation

Read PDF arXiv

Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See

Automating Agent Hijacking via Structural Template Injection

Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Self-HarmLLM: Can Large Language Model Harm Itself?

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

The Echo Chamber Multi-Turn LLM Jailbreak