defense arXiv Apr 16, 2026 · 5w ago
Yisheng Zhong, Sijia Liu, Zhuangdi Zhu · George Mason University · Michigan State University
Multi-objective LLM unlearning framework that removes hazardous knowledge while defending against adversarial probing attacks via bidirectional distillation
Model Inversion Attack Prompt Injection Sensitive Information Disclosure nlp
Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robustness against adversarial probing attacks. However, existing unlearning methods primarily focus on a limited subset of these goals, typically unlearning efficacy and utility preservation while overlooking robustness and boundary behaviors. Naively extending these methods to multi-objective settings may lead to unlearning task interference. We propose a novel multi-objective unlearning framework that harmonizes multiple unlearning objectives through a data and optimization co-design: We standardize training corpora into a unified data representation to reduce the domain gap, and then introduce a bidirectional distillation method that simultaneously elicits desired behavior from a context-instructed teacher while suppressing undesirable behavior in the student model. Theoretical and empirical analyses show that our method aligns domain distributions and converts seemingly irrelevant unlearning tasks into cooperative optimization. Evaluation demonstrates state-of-the-art performance, which enables balanced and reliable unlearning across diverse, challenging requirements.
llm transformer George Mason University · Michigan State University
benchmark arXiv Apr 23, 2026 · 28d ago
Xiaoyi Chen, Haoyuan Wang, Siyuan Tang et al. · Indiana University Bloomington · Independent Researcher +3 more
Evaluation framework exposing weaknesses in LLM privacy unlearning through three-tier attacks: direct retrieval, in-context recovery, and fine-tuning restoration
Model Inversion Attack Sensitive Information Disclosure nlp
Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.
llm transformer Indiana University Bloomington · Independent Researcher · Michigan State University +2 more