defense 2026

ROKA: Robust Knowledge Unlearning against Adversaries

Jinmyeong Shin 1, Joshua Tapia 1, Nicholas Ferreira 2, Gabriel Diaz 1, Moayed Daneshyari 2, Hyeran Jeon 1

0 citations

α

Published on arXiv

2603.00436

Model Skewing

OWASP ML Top 10 — ML08

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

ROKA preserves or enhances retained data accuracy while successfully unlearning targets across vision transformers, CLIP, and Llama, neutralizing indirect unlearning attacks that conventional methods leave exploitable.

ROKA (Neural Healing)

Novel technique introduced


The need for machine unlearning is critical for data privacy, yet existing methods often cause Knowledge Contamination by unintentionally damaging related knowledge. Such a degraded model performance after unlearning has been recently leveraged for new inference and backdoor attacks. Most studies design adversarial unlearning requests that require poisoning or duplicating training data. In this study, we introduce a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions. To mitigate this attack, we introduce a theoretical framework that models neural networks as Neural Knowledge Systems. Based on this, we propose ROKA, a robust unlearning strategy centered on Neural Healing. Unlike conventional unlearning methods that only destroy information, ROKA constructively rebalances the model by nullifying the influence of forgotten data while strengthening its conceptual neighbors. To the best of our knowledge, our work is the first to provide a theoretical guarantee for knowledge preservation during unlearning. Evaluations on various large models, including vision transformers, multi-modal models, and large language models, show that ROKA effectively unlearns targets while preserving, or even enhancing, the accuracy of retained data, thereby mitigating the indirect unlearning attacks.


Key Contributions

  • Novel 'indirect unlearning attack': adversary submits a legitimate unlearning request for unrelated data, causing knowledge contamination that degrades security-critical predictions (e.g., face recognition bypass) without any data poisoning or duplication
  • ROKA (Neural Healing): robust unlearning strategy that nullifies forgotten data influence while simultaneously strengthening conceptual neighbor knowledge, preventing knowledge contamination
  • First theoretical guarantee for knowledge preservation during unlearning, via a Neural Knowledge Systems framework including a Sibling Knowledge Preservation theorem

🛡️ Threat Analysis

Model Skewing

The 'indirect unlearning attack' exploits the unlearning feedback mechanism — an adversary submits a legitimate unlearning request that causes knowledge contamination, selectively degrading model accuracy on security-critical predictions (e.g., bypassing face recognition) without any direct data poisoning. This is model skewing via manipulation of the model update/feedback process.

Model Poisoning

ROKA explicitly defends against backdoor attacks triggered by unlearning — a known threat where adversarial unlearning requests activate hidden malicious behavior in the model. The paper's Neural Healing approach is evaluated against this backdoor-via-unlearning threat in addition to the indirect attack.


Details

Domains
visionnlpmultimodal
Model Types
transformerllmvlm
Threat Tags
black_boxtraining_timetargeted
Applications
face recognitionimage classificationlarge language model fine-tuning