Darpan Aswal

defense arXiv Aug 22, 2025 · Aug 2025

Darpan Aswal, Céline Hudelot · Université Paris-Saclay · CentraleSupélec

Defends LLMs against jailbreaks by using sparse autoencoders to identify interpretable internal activation concepts linked to attack themes

Prompt Injection nlp

Papers in Database (1)