Zhexin Zhang

h-index: 6 188 citations 18 papers (total)

Papers in Database (2)

attack arXiv Nov 3, 2025 · Nov 2025

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +1 more

Indirect prompt injection hidden inside academic papers hijacks LLM-based AI reviewers into awarding perfect scores

Prompt Injection nlp
1 citations PDF
benchmark arXiv Feb 4, 2026 · 8w ago

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang et al. · Tsinghua University · National University of Singapore +1 more

First systematic taxonomy of training-time implicit safety risks in RL-trained LLMs, showing risky behaviors in 74.4% of runs

Model Skewing Excessive Agency nlpreinforcement-learning
PDF