Selective Fine-Tuning for Targeted and Robust Concept Unlearning
Mansi , Avinash Kori , Francesca Toni , Soteris Demetriou
Published on arXiv
2602.07919
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
TRUST achieves robust unlearning of individual, combined, and conditional concepts against adversarial prompts while preserving generation quality and requiring over 8x less wall-clock time than SOTA methods like SalUn.
TRUST
Novel technique introduced
Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree, and is also significantly faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.
Key Contributions
- Dynamic neuron localization that continuously updates which neurons to fine-tune during unlearning, replacing static saliency determination that becomes outdated early in training
- Hessian-based regularization to preserve generation quality for non-targeted concepts during selective fine-tuning
- Handles concept combination erasure (CCE) — unlearning harmful combinations of individually benign concepts — without additional regularization, while remaining robust against adversarial prompts