benchmark 2026

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah 1, Jing Huang 2, Keerthiram Murugesan 3, Nathalie Baracaldo 3, Diyi Yang 2

0 citations

α

Published on arXiv

2603.11266

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Multi-hop and alias-based probes recover forgotten information missed by all existing static benchmarks, with activation analysis showing unlearning disrupts dominant pathways while leaving alternative multi-hop pathways intact.

Unlearning Mirage Framework

Novel technique introduced


Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as multi-hop reasoning and entity aliasing, can recover supposedly forgotten information. As a result, current evaluation metrics often create an illusion of effectiveness, failing to detect these vulnerabilities due to reliance on static, unstructured benchmarks. We propose a dynamic framework that stress tests unlearning robustness using complex structured queries. Our approach first elicits knowledge from the target model (pre-unlearning) and constructs targeted probes, ranging from simple queries to multi-hop chains, allowing precise control over query difficulty. Our experiments show that the framework (1) shows comparable coverage to existing benchmarks by automatically generating semantically equivalent Q&A probes, (2) aligns with prior evaluations, and (3) uncovers new unlearning failures missed by other benchmarks, particularly in multi-hop settings. Furthermore, activation analyses show that single-hop queries typically follow dominant computation pathways, which are more likely to be disrupted by unlearning methods. In contrast, multi-hop queries tend to use alternative pathways that often remain intact, explaining the brittleness of unlearning techniques in multi-hop settings. Our framework enables practical and scalable evaluation of unlearning methods without the need for manual construction of forget test sets, enabling easier adoption for real-world applications. We release the pip package and the code at https://sites.google.com/view/unlearningmirage/home.


Key Contributions

  • Dynamic evaluation framework that constructs knowledge graphs from pre-unlearning model outputs and generates structured probes (single-hop, multi-hop, alias-based) to stress-test unlearning robustness without manual dataset construction
  • Empirical demonstration that existing LLM unlearning methods are brittle: multi-hop and alias queries recover supposedly forgotten information that single-hop evaluations miss (~78% coverage of RWKU Q&A pairs)
  • Activation analysis via PatchScopes revealing that unlearning disrupts dominant computation pathways used by direct queries, while multi-hop queries route through alternative intact pathways — explaining systematic evaluation failures

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
TOFURWKUWMDPMUSEWHP
Applications
llm unlearning evaluationknowledge deletion in language models