attack 2025

Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Agnieszka Polowczyk ¹, Alicja Polowczyk ¹, Joanna Waczyńska ², Piotr Borycki ², Przemysław Spurek ^2,3

¹ Silesian University of Technology

² Jagiellonian University

³ IDEAS Research Institute

1 citations · 51 references · arXiv

Published on arXiv

2510.03263

Model Inversion Attack

OWASP ML Top 10 — ML03

Key Finding

MemoRa recovers supposedly unlearned concepts (nudity, artistic styles, objects) from unlearned diffusion models using only a few reference images, revealing that current unlearning methods achieve only temporary suppression rather than true knowledge removal.

MemoRa

Novel technique introduced

The impressive capability of modern text-to-image models to generate realistic visuals has come with a serious drawback: they can be misused to create harmful, deceptive or unlawful content. This has accelerated the push for machine unlearning. This new field seeks to selectively remove specific knowledge from a model's training data without causing a drop in its overall performance. However, it turns out that actually forgetting a given concept is an extremely difficult task. Models exposed to attacks using adversarial prompts show the ability to generate so-called unlearned concepts, which can be not only harmful but also illegal. In this paper, we present considerations regarding the ability of models to forget and recall knowledge, introducing the Memory Self-Regeneration task. Furthermore, we present MemoRa strategy, which we consider to be a regenerative approach supporting the effective recovery of previously lost knowledge. Moreover, we propose that robustness in knowledge retrieval is a crucial yet underexplored evaluation measure for developing more robust and effective unlearning techniques. Finally, we demonstrate that forgetting occurs in two distinct ways: short-term, where concepts can be quickly recalled, and long-term, where recovery is more challenging. Code is available at https://gmum.github.io/MemoRa/.

Key Contributions

Introduces the Memory Self-Regeneration task, formalizing the problem of recovering unlearned concepts from diffusion models
Proposes MemoRa, a practical attack using DDIM inversion, spherical interpolation, and LoRA adapter fine-tuning to restore erased concepts with only a few reference images
Identifies two distinct modes of machine forgetting — short-term (superficial suppression, quickly reversed) and long-term (deeper representational changes, harder to recover) — exposing a fundamental weakness in current unlearning evaluations

🛡️ Threat Analysis

Model Inversion Attack

Following the machine unlearning decision tree: the paper explicitly ATTACKS an unlearning method by showing that 'unlearned' generative knowledge can still be extracted from the model. MemoRa is an adversarial strategy where an attacker uses a few reference images + DDIM inversion + LoRA fine-tuning to recover hidden residual knowledge (e.g., NSFW content, artistic styles) that the model was supposed to have permanently forgotten — the adversary extracts protected model knowledge despite unlearning defenses.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

black_boxtraining_timetargeted

Datasets

ESD (Erased Stable Diffusion)FLUX.1 [dev]

Applications

text-to-image generationmachine unlearning for diffusion models

Read PDF arXiv DOI Code

Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Prompt Pirates Need a Map: Stealing Seeds helps Stealing Prompts

Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability

Beyond Memorization: Selective Learning for Copyright-Safe Diffusion Model Training

A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis

Latent Diffusion Inversion Requires Understanding the Latent Space

Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Demystifying Foreground-Background Memorization in Diffusion Models