Daling Wang

h-index: 24 2,384 citations 217 papers (total)

Papers in Database (1)

defense arXiv Jan 15, 2026 · 11w ago

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Yinzhi Zhao, Ming Wang, Shi Feng et al. · Northeastern University

Defends LLMs against jailbreaks by probing latent safety signals during decoding to detect and block harmful outputs early

Prompt Injection nlp
PDF Code