Zhiyuan He

h-index: 1 24 citations 2 papers (total)

Papers in Database (1)

benchmark arXiv Feb 3, 2026 · 8w ago

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen et al. · The Chinese University of Hong Kong · IBM Research

Reveals that benign activation steering vectors inadvertently erode LLM safety guardrails, amplifying jailbreak success rates past 80%

Prompt Injection nlp
PDF