Jacopo Cortellazzi

h-index: 0 0 citations 0 papers (total)

Papers in Database (1)

defense arXiv Feb 12, 2026 · 7w ago

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad et al. · Apple · ETH Zürich

Defends LLMs against jailbreaks via SAE feature-space steering, outperforming dense activation steering on four models across twelve attacks

Prompt Injection nlp
PDF