Darpan Aswal

Papers in Database (1)

defense arXiv Aug 22, 2025 · Aug 2025

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Darpan Aswal, Céline Hudelot · Université Paris-Saclay · CentraleSupélec

Defends LLMs against jailbreaks by using sparse autoencoders to identify interpretable internal activation concepts linked to attack themes

Prompt Injection nlp
PDF