Dana Arad

Papers in Database (1)

defense arXiv Aug 19, 2025 · Aug 2025

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach, Dana Arad, Aaron Mueller et al. · Technion – Israel Institute of Technology · Boston University +1 more

Permanently removes dangerous LLM knowledge by suppressing sparse autoencoder features via fine-tuning, blocking adversarial bypass of inference-time safety measures

Prompt Injection nlp
PDF Code