Xianpei Han

h-index: 26 3,832 citations 132 papers (total)

Papers in Database (1)

defense arXiv Oct 24, 2025 · Oct 2025

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang et al. · Institute of Software · University of Chinese Academy of Sciences

Discovers Self-Jailbreak in LRMs — models override their own safety judgments mid-reasoning — and defends with step-level trajectory training

Prompt Injection nlp
PDF