defense 2026

ROKA: Robust Knowledge Unlearning against Adversaries

Jinmyeong Shin ¹, Joshua Tapia ¹, Nicholas Ferreira ², Gabriel Diaz ¹, Moayed Daneshyari ², Hyeran Jeon ¹

¹ University of California, Merced

² California State University, East Bay

0 citations

Published on arXiv

2603.00436

Model Skewing

OWASP ML Top 10 — ML08

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

ROKA preserves or enhances retained data accuracy while successfully unlearning targets across vision transformers, CLIP, and Llama, neutralizing indirect unlearning attacks that conventional methods leave exploitable.

ROKA (Neural Healing)

Novel technique introduced

The need for machine unlearning is critical for data privacy, yet existing methods often cause Knowledge Contamination by unintentionally damaging related knowledge. Such a degraded model performance after unlearning has been recently leveraged for new inference and backdoor attacks. Most studies design adversarial unlearning requests that require poisoning or duplicating training data. In this study, we introduce a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions. To mitigate this attack, we introduce a theoretical framework that models neural networks as Neural Knowledge Systems. Based on this, we propose ROKA, a robust unlearning strategy centered on Neural Healing. Unlike conventional unlearning methods that only destroy information, ROKA constructively rebalances the model by nullifying the influence of forgotten data while strengthening its conceptual neighbors. To the best of our knowledge, our work is the first to provide a theoretical guarantee for knowledge preservation during unlearning. Evaluations on various large models, including vision transformers, multi-modal models, and large language models, show that ROKA effectively unlearns targets while preserving, or even enhancing, the accuracy of retained data, thereby mitigating the indirect unlearning attacks.

Key Contributions

Novel 'indirect unlearning attack': adversary submits a legitimate unlearning request for unrelated data, causing knowledge contamination that degrades security-critical predictions (e.g., face recognition bypass) without any data poisoning or duplication
ROKA (Neural Healing): robust unlearning strategy that nullifies forgotten data influence while simultaneously strengthening conceptual neighbor knowledge, preventing knowledge contamination
First theoretical guarantee for knowledge preservation during unlearning, via a Neural Knowledge Systems framework including a Sibling Knowledge Preservation theorem

🛡️ Threat Analysis

Model Skewing

The 'indirect unlearning attack' exploits the unlearning feedback mechanism — an adversary submits a legitimate unlearning request that causes knowledge contamination, selectively degrading model accuracy on security-critical predictions (e.g., bypassing face recognition) without any direct data poisoning. This is model skewing via manipulation of the model update/feedback process.

Model Poisoning

ROKA explicitly defends against backdoor attacks triggered by unlearning — a known threat where adversarial unlearning requests activate hidden malicious behavior in the model. The paper's Neural Healing approach is evaluated against this backdoor-via-unlearning threat in addition to the indirect attack.

Details

Domains

visionnlpmultimodal

Model Types

transformerllmvlm

Threat Tags

black_boxtraining_timetargeted

Applications

face recognitionimage classificationlarge language model fine-tuning

Read PDF arXiv

ROKA: Robust Knowledge Unlearning against Adversaries

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models

TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks

Test-Time Attention Purification for Backdoored Large Vision Language Models

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models