ROKA: Robust Knowledge Unlearning against Adversaries
Jinmyeong Shin 1, Joshua Tapia 1, Nicholas Ferreira 2, Gabriel Diaz 1, Moayed Daneshyari 2, Hyeran Jeon 1
Published on arXiv
2603.00436
Model Skewing
OWASP ML Top 10 — ML08
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
ROKA preserves or enhances retained data accuracy while successfully unlearning targets across vision transformers, CLIP, and Llama, neutralizing indirect unlearning attacks that conventional methods leave exploitable.
ROKA (Neural Healing)
Novel technique introduced
The need for machine unlearning is critical for data privacy, yet existing methods often cause Knowledge Contamination by unintentionally damaging related knowledge. Such a degraded model performance after unlearning has been recently leveraged for new inference and backdoor attacks. Most studies design adversarial unlearning requests that require poisoning or duplicating training data. In this study, we introduce a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions. To mitigate this attack, we introduce a theoretical framework that models neural networks as Neural Knowledge Systems. Based on this, we propose ROKA, a robust unlearning strategy centered on Neural Healing. Unlike conventional unlearning methods that only destroy information, ROKA constructively rebalances the model by nullifying the influence of forgotten data while strengthening its conceptual neighbors. To the best of our knowledge, our work is the first to provide a theoretical guarantee for knowledge preservation during unlearning. Evaluations on various large models, including vision transformers, multi-modal models, and large language models, show that ROKA effectively unlearns targets while preserving, or even enhancing, the accuracy of retained data, thereby mitigating the indirect unlearning attacks.
Key Contributions
- Novel 'indirect unlearning attack': adversary submits a legitimate unlearning request for unrelated data, causing knowledge contamination that degrades security-critical predictions (e.g., face recognition bypass) without any data poisoning or duplication
- ROKA (Neural Healing): robust unlearning strategy that nullifies forgotten data influence while simultaneously strengthening conceptual neighbor knowledge, preventing knowledge contamination
- First theoretical guarantee for knowledge preservation during unlearning, via a Neural Knowledge Systems framework including a Sibling Knowledge Preservation theorem
🛡️ Threat Analysis
The 'indirect unlearning attack' exploits the unlearning feedback mechanism — an adversary submits a legitimate unlearning request that causes knowledge contamination, selectively degrading model accuracy on security-critical predictions (e.g., bypassing face recognition) without any direct data poisoning. This is model skewing via manipulation of the model update/feedback process.
ROKA explicitly defends against backdoor attacks triggered by unlearning — a known threat where adversarial unlearning requests activate hidden malicious behavior in the model. The paper's Neural Healing approach is evaluated against this backdoor-via-unlearning threat in addition to the indirect attack.