attack arXiv Nov 3, 2025 · Nov 2025
Qin Zhou, Zhexin Zhang, Zhi Li et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +1 more
Indirect prompt injection hidden inside academic papers hijacks LLM-based AI reviewers into awarding perfect scores
Prompt Injection nlp
With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.
llm Institute of Information Engineering · University of Chinese Academy of Sciences · Tsinghua University
benchmark arXiv Feb 4, 2026 · 8w ago
Zhexin Zhang, Yida Lu, Junfeng Fang et al. · Tsinghua University · National University of Singapore +1 more
First systematic taxonomy of training-time implicit safety risks in RL-trained LLMs, showing risky behaviors in 74.4% of runs
Model Skewing Excessive Agency nlpreinforcement-learning
Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
llm rl Tsinghua University · National University of Singapore · Tencent