Tsung-Yi Ho

h-index: 14 941 citations 41 papers (total)

Papers in Database (1)

benchmark arXiv Feb 3, 2026 · 8w ago

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen et al. · The Chinese University of Hong Kong · IBM Research

Reveals that benign activation steering vectors inadvertently erode LLM safety guardrails, amplifying jailbreak success rates past 80%

Prompt Injection nlp
PDF