Adam D. Cobb

Papers in Database (1)

attack arXiv Apr 22, 2026 · 29d ago

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal, Ramneet Kaur, Colin Samplawski et al. · University of Florida · SRI

Interpretability-driven jailbreak audit using activation steering on 8 LLMs, achieving 91% success on Llama-3.3-70B

Prompt Injection nlp
PDF