The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning
Raj Sanjay Shah 1, Jing Huang 2, Keerthiram Murugesan 3, Nathalie Baracaldo 3, Diyi Yang 2
Published on arXiv
2603.11266
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Multi-hop and alias-based probes recover forgotten information missed by all existing static benchmarks, with activation analysis showing unlearning disrupts dominant pathways while leaving alternative multi-hop pathways intact.
Unlearning Mirage Framework
Novel technique introduced
Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as multi-hop reasoning and entity aliasing, can recover supposedly forgotten information. As a result, current evaluation metrics often create an illusion of effectiveness, failing to detect these vulnerabilities due to reliance on static, unstructured benchmarks. We propose a dynamic framework that stress tests unlearning robustness using complex structured queries. Our approach first elicits knowledge from the target model (pre-unlearning) and constructs targeted probes, ranging from simple queries to multi-hop chains, allowing precise control over query difficulty. Our experiments show that the framework (1) shows comparable coverage to existing benchmarks by automatically generating semantically equivalent Q&A probes, (2) aligns with prior evaluations, and (3) uncovers new unlearning failures missed by other benchmarks, particularly in multi-hop settings. Furthermore, activation analyses show that single-hop queries typically follow dominant computation pathways, which are more likely to be disrupted by unlearning methods. In contrast, multi-hop queries tend to use alternative pathways that often remain intact, explaining the brittleness of unlearning techniques in multi-hop settings. Our framework enables practical and scalable evaluation of unlearning methods without the need for manual construction of forget test sets, enabling easier adoption for real-world applications. We release the pip package and the code at https://sites.google.com/view/unlearningmirage/home.
Key Contributions
- Dynamic evaluation framework that constructs knowledge graphs from pre-unlearning model outputs and generates structured probes (single-hop, multi-hop, alias-based) to stress-test unlearning robustness without manual dataset construction
- Empirical demonstration that existing LLM unlearning methods are brittle: multi-hop and alias queries recover supposedly forgotten information that single-hop evaluations miss (~78% coverage of RWKU Q&A pairs)
- Activation analysis via PatchScopes revealing that unlearning disrupts dominant computation pathways used by direct queries, while multi-hop queries route through alternative intact pathways — explaining systematic evaluation failures