benchmark 2025

The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework

Aakriti Shah 1, Thai Le 2

0 citations · 32 references · arXiv

α

Published on arXiv

2510.25732

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Authority framing recovers up to 128% more unlearned knowledge in 2.7B-parameter models, and knowledge entanglement metrics (M9) explain 78% of variance in unlearning robustness across OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, and LLaMA-2-13B

SKeB (Stimulus-Knowledge Entanglement-Behavior Framework)

Novel technique introduced


Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.


Key Contributions

  • SKeB framework grounding unlearning robustness evaluation in cognitive science (ACT-R, Hebbian spreading activation) and communication principles (rhetorical framing types)
  • Nine graph-based knowledge entanglement metrics, with distance-weighted influence (M9) achieving r=0.77 correlation with factual recall in unlearned models
  • Empirical evidence that authority framing recovers 14.8%→24.5% unlearned factual knowledge, with a strong inverse correlation between model size and vulnerability (128% recovery in 2.7B vs. 15% in 13B)

🛡️ Threat Analysis

Model Inversion Attack

Paper demonstrates extraction of training data (unlearned factual knowledge) from LLMs via persuasive prompting — an adversary actively recovering information the model was supposed to have forgotten, fitting the 'LLM memorization extraction' sub-case of ML03.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
TOFU
Applications
machine unlearningllm knowledge extractionprivacy compliance evaluation