benchmark 2025

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh 1, Jiajun Ruan 1, Yiwei Chen 2, Soumyadeep Pal 2, Sijia Liu 2,3, Mingyi Hong 1

1 citations · 48 references · arXiv

α

Published on arXiv

2511.04934

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

All tested state-of-the-art LLM unlearning methods fail to prevent knowledge leakage under probabilistic decoding, while the proposed RULE algorithm achieves zero leakage on TOFU across a large number of generation samples.

leak@k / RULE

Novel technique introduced


Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks (as has been done in the literature), we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using our newly defined \texttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art unlearning techniques provide only limited forgetting and highlighting the urgent need for more robust approaches to LLM unlearning. We propose an algorithm, termed Robust Unlearning under LEak@$k$ metric (\texttt{RULE}), which serves as an initial step toward addressing this concern. We demonstrate that \texttt{RULE} provides an unlearned model for TOFU benchmark with no information leakage for a large number of generation samples.


Key Contributions

  • Introduces leak@k, a meta-evaluation metric quantifying the probability that supposedly-forgotten knowledge resurfaces across k probabilistic samples from an unlearned LLM
  • First large-scale systematic study showing virtually all state-of-the-art LLM unlearning methods fail under probabilistic decoding across TOFU, MUSE, and WMDP benchmarks
  • Proposes RULE (Robust Unlearning under LEak@k metric), an algorithm achieving zero leakage on TOFU even for large numbers of generation samples

🛡️ Threat Analysis

Model Inversion Attack

Shows that an adversary using probabilistic decoding (sampling k outputs) can extract supposedly-forgotten training data and knowledge from 'unlearned' LLMs — a concrete training data extraction vulnerability against models claiming to have removed that information.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
TOFUMUSEWMDP
Applications
llm unlearningprivacy-preserving language modelssafety-aligned language models