Yingjie Zhang

Papers in Database (1)

defense arXiv Sep 18, 2025 ยท Sep 2025

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Yuanbo Xie, Yingjie Zhang, Tianyun Liu et al.

Defends LLMs against jailbreaks by probabilistically ablating refusal directions during fine-tuning, forcing models to rebuild safety from adversarial states

Prompt Injection nlp
PDF Code