benchmark 2025

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Chongyu Fan ^1,2, Changsheng Wang ^1,2, Yancheng Huang ^1,2, Soumyadeep Pal ^1,2, Sijia Liu ^1,2

¹ Michigan State University

² IBM Research

0 citations · 69 references · arXiv

Published on arXiv

2510.07626

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MCQ-based evaluations overstate unlearning success; Open-QA metrics reveal residual harmful generation capability and expose a fundamental unlearning effectiveness–utility tradeoff that differs across method families and attack types.

Open-QA metrics

Novel technique introduced

Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model's actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.

Key Contributions

Principled taxonomy of 12 LLM unlearning methods grouped into three families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning
Open-QA evaluation metrics that capture free-form generative behavior post-unlearning, exposing how MCQ-based metrics overstate unlearning success on WMDP
Fine-grained robustness analysis distinguishing in-domain relearning vs. out-of-domain fine-tuning attacks and input-level attacks, revealing family-specific vulnerability profiles

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_time

Datasets

WMDP

Applications

llm safetyharmful content removalknowledge unlearning

Read PDF arXiv DOI

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Analysing the Safety Pitfalls of Steering Vectors

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

A Granular Study of Safety Pretraining under Model Abliteration

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

SecureBreak -- A dataset towards safe and secure models

Defenses Against Prompt Attacks Learn Surface Heuristics