defense 2025

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach 1, Dana Arad 1, Aaron Mueller 2, Martin Tutek 3, Yonatan Belinkov 1

0 citations

α

Published on arXiv

2508.13650

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Outperforms prior unlearning methods by 5–34 points on the WMDP safety benchmark while maintaining benign knowledge retention and generation fluency across two LLMs

CRISP

Novel technique introduced


As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.


Key Contributions

  • Automated contrastive activation analysis pipeline that identifies SAE features salient to a target concept while discriminating from benign concepts
  • CRISP: a parameter-efficient persistent unlearning method that suppresses identified SAE features via LoRA fine-tuning, making safety interventions robust to parameter-level bypass
  • Feature-level analysis demonstrating semantically coherent separation between target and benign concept directions in the SAE feature space

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetraining_time
Datasets
WMDPMMLU
Applications
llm safety alignmentdangerous capabilities removalhazardous knowledge unlearning