benchmark 2025

Understanding the Dilemma of Unlearning for Large Language Models

Qingjie Zhang ¹, Haoting Qian ¹, Zhicong Huang ², Cheng Hong ², Minlie Huang ¹, Ke Xu ¹, Chao Zhang ¹, Han Qiu ¹

¹ Tsinghua University

² Ant Group

3 citations · 65 references · arXiv

Published on arXiv

2509.24675

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Across six unlearning methods and three LLMs, supposedly forgotten knowledge can be recovered without any weight modification by simply emphasizing relevant keywords in prompts, revealing that current unlearning methods achieve only superficial suppression rather than true erasure.

unPact

Novel technique introduced

Unlearning seeks to remove specific knowledge from large language models (LLMs), but its effectiveness remains contested. On one side, "forgotten" knowledge can often be recovered through interventions such as light fine-tuning; on the other side, unlearning may induce catastrophic forgetting that degrades general capabilities. Despite active exploration of unlearning methods, interpretability analyses of the mechanism are scarce due to the difficulty of tracing knowledge in LLMs' complex architectures. We address this gap by proposing unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking. Typically, it quantifies each prompt token's influence on outputs, enabling pre- and post-unlearning comparisons to reveal what changes. Across six mainstream unlearning methods, three LLMs, and three benchmarks, we find that: (1) Unlearning appears to be effective by disrupting focus on keywords in prompt; (2) Much of the knowledge is not truly erased and can be recovered by simply emphasizing these keywords in prompts, without modifying the model's weights; (3) Catastrophic forgetting arises from indiscriminate penalization of all tokens. Taken together, our results suggest an unlearning dilemma: existing methods tend either to be insufficient - knowledge remains recoverable by keyword emphasis, or overly destructive - general performance collapses due to catastrophic forgetting, still leaving a gap to reliable unlearning.

Key Contributions

unPact: an interpretability framework that quantifies each prompt token's influence on LLM outputs, enabling pre- and post-unlearning comparisons to reveal mechanistic changes
Demonstration that supposedly unlearned knowledge can be recovered without weight modification, simply by emphasizing relevant keywords in prompts — exposing a fundamental insufficiency in six mainstream unlearning methods
Attribution-based diagnosis showing that catastrophic forgetting arises from indiscriminate penalization of all prompt tokens, including common words critical for general capability

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

TOFUWMDPMUSE

Applications

llm knowledge unlearninghazardous knowledge removalsensitive training data removal

Read PDF arXiv DOI Code

Understanding the Dilemma of Unlearning for Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Retrieval Pivot Attacks in Hybrid RAG: Measuring and Mitigating Amplified Leakage from Vector Seeds to Graph Expansion

Privacy Preserving In-Context-Learning Framework for Large Language Models

Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software

Network-Level Prompt and Trait Leakage in Local Research Agents