Yuxiao Li

Papers in Database (1)

benchmark arXiv Mar 25, 2026 · 12d ago

Analysing the Safety Pitfalls of Steering Vectors

Yuxiao Li, Alina Fastowski, Efstratios Zaradoukas et al. · Technical University of Munich

Activation steering vectors systematically erode LLM safety alignment, increasing jailbreak success rates up to 57% by interfering with refusal behavior directions

Prompt Injection nlp
PDF