benchmark 2026

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah ¹, Jing Huang ², Keerthiram Murugesan ³, Nathalie Baracaldo ³, Diyi Yang ²

¹ Georgia Institute of Technology

² Stanford University

³ IBM Research

0 citations

Published on arXiv

2603.11266

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Multi-hop and alias-based probes recover forgotten information missed by all existing static benchmarks, with activation analysis showing unlearning disrupts dominant pathways while leaving alternative multi-hop pathways intact.

Unlearning Mirage Framework

Novel technique introduced

Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as multi-hop reasoning and entity aliasing, can recover supposedly forgotten information. As a result, current evaluation metrics often create an illusion of effectiveness, failing to detect these vulnerabilities due to reliance on static, unstructured benchmarks. We propose a dynamic framework that stress tests unlearning robustness using complex structured queries. Our approach first elicits knowledge from the target model (pre-unlearning) and constructs targeted probes, ranging from simple queries to multi-hop chains, allowing precise control over query difficulty. Our experiments show that the framework (1) shows comparable coverage to existing benchmarks by automatically generating semantically equivalent Q&A probes, (2) aligns with prior evaluations, and (3) uncovers new unlearning failures missed by other benchmarks, particularly in multi-hop settings. Furthermore, activation analyses show that single-hop queries typically follow dominant computation pathways, which are more likely to be disrupted by unlearning methods. In contrast, multi-hop queries tend to use alternative pathways that often remain intact, explaining the brittleness of unlearning techniques in multi-hop settings. Our framework enables practical and scalable evaluation of unlearning methods without the need for manual construction of forget test sets, enabling easier adoption for real-world applications. We release the pip package and the code at https://sites.google.com/view/unlearningmirage/home.

Key Contributions

Dynamic evaluation framework that constructs knowledge graphs from pre-unlearning model outputs and generates structured probes (single-hop, multi-hop, alias-based) to stress-test unlearning robustness without manual dataset construction
Empirical demonstration that existing LLM unlearning methods are brittle: multi-hop and alias queries recover supposedly forgotten information that single-hop evaluations miss (~78% coverage of RWKU Q&A pairs)
Activation analysis via PatchScopes revealing that unlearning disrupts dominant computation pathways used by direct queries, while multi-hop queries route through alternative intact pathways — explaining systematic evaluation failures

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

TOFURWKUWMDPMUSEWHP

Applications

llm unlearning evaluationknowledge deletion in language models

Read PDF arXiv Code

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries

Understanding the Dilemma of Unlearning for Large Language Models

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Retrieval Pivot Attacks in Hybrid RAG: Measuring and Mitigating Amplified Leakage from Vector Seeds to Graph Expansion

Network-Level Prompt and Trait Leakage in Local Research Agents

Privacy Preserving In-Context-Learning Framework for Large Language Models

Differentially Private Retrieval-Augmented Generation