attack 2026

Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Masahiro Kaneko

Mohamed bin Zayed University of Artificial Intelligence

1 citations · 42 references · arXiv

Published on arXiv

2601.06884

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PAA consistently inflates LLM reviewer scores across all tested conferences and reviewer models without changing scientific claims, and adversarial paraphrases transfer across different LLM reviewers even when the target model is unknown.

PAA (Paraphrasing Adversarial Attack)

Novel technique introduced

The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.

Key Contributions

PAA: an in-context-learning-guided black-box iterative search over paraphrased abstracts that maximizes LLM reviewer scores while preserving semantic equivalence and linguistic naturalness
Comprehensive evaluation across 5 ML/NLP conferences, 3 LLM reviewers, and 5 attacking models, revealing self-preference bias and cross-model transferability of adversarial paraphrases
Identification of increased review perplexity as a detection signal for PAA-attacked submissions, and defensive paraphrasing as a partial mitigation

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Datasets

AAAIICLRACLEMNLPNAACL conference paper submissions

Applications

llm-based peer reviewllm-as-a-judge evaluation systems

Read PDF arXiv DOI

Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Emoji-Based Jailbreaking of Large Language Models

In-Context Representation Hijacking

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Persona Jailbreaking in Large Language Models

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation