Ching-Yun Ko

h-index: 5 105 citations 12 papers (total)

Papers in Database (1)

benchmark arXiv Feb 3, 2026 · 8w ago

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen et al. · The Chinese University of Hong Kong · IBM Research

Reveals that benign activation steering vectors inadvertently erode LLM safety guardrails, amplifying jailbreak success rates past 80%

Prompt Injection nlp
PDF