Seonglae Cho

defense arXiv Aug 18, 2025 · Aug 2025

Seonglae Cho, Zekun Wu, Adriano Koshiyama · Holistic AI · University College London

Steers LLMs at inference time via correlated SAE features to prevent jailbreaks, improving HarmBench by 27.2% with 108 samples

Prompt Injection nlp

Papers in Database (1)