defense 2025

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Yicheng Lang ¹, Yihua Zhang ¹, Chongyu Fan ¹, Changsheng Wang ¹, Jinghan Jia ¹, Sijia Liu ^1,2

¹ Michigan State University

² IBM Research

0 citations · 81 references · arXiv

Published on arXiv

2510.00761

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Downgrading optimizers from first-order to zeroth-order or gradient-sign variants improves resilience of LLM unlearning against relearning attacks and weight quantization without sacrificing unlearning quality.

FO-ZO Hybrid Optimizer

Novel technique introduced

Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the 'grade' of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.

Key Contributions

First systematic study showing that lower-order optimizers (zeroth-order, gradient-sign) improve LLM unlearning robustness by converging to flatter, harder-to-disturb loss landscape basins resistant to post-training perturbations.
Theoretical connection between zeroth-order optimization and randomized smoothing, explaining ZO methods' natural advantage for robust unlearning.
Hybrid FO-ZO optimizer that combines the unlearning efficacy of first-order updates with the tamper-resistance of zeroth-order methods, validated across multiple unlearning algorithms on MUSE and WMDP benchmarks.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_time

Datasets

MUSEWMDP

Applications

llm unlearninglanguage model safetyprivacy-preserving language models

Read PDF arXiv DOI

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI

Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning