attack 2026

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Ignacy Kolton ¹, Kacper Marzol ¹, Paweł Batorski ², Marcin Mazur ¹, Paul Swoboda ², Przemysław Spurek ^1,3

¹ Jagiellonian University

² Heinrich Heine Universität Düsseldorf

³ IDEAS Research Institute

0 citations · 36 references · arXiv

Published on arXiv

2602.00350

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

ReLAPSe recovers erased visual concepts (nudity, styles, objects, identities) from multiple state-of-the-art unlearned diffusion models near-instantly, outperforming per-instance optimization baselines in both efficiency and concept restoration quality.

ReLAPSe

Novel technique introduced

Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe

Key Contributions

ReLAPSe: an RL-based adversarial framework using RLVR with the diffusion model's noise prediction loss as a model-intrinsic reward signal for adversarial prompt search
Shift from expensive per-instance optimization to global policy learning, enabling near-real-time recovery of erased concepts at deployment
Demonstrates persistent latent concept residuals across multiple state-of-the-art unlearning methods (nudity, artistic styles, objects, fine-grained identities)

🛡️ Threat Analysis

Input Manipulation Attack

ReLAPSe crafts adversarial text prompt inputs (via RL policy optimization) that cause unlearned diffusion models to produce suppressed concepts at inference time — a direct input manipulation / evasion attack against a safety defense. The attack surfaces latent concept residuals by manipulating the model's textual input, not by accessing or reconstructing original training data records.

Details

Domains

generative

Model Types

diffusion

Threat Tags

grey_boxinference_timetargeted

Datasets

ESD-unlearned Stable Diffusionnudity conceptsVan Gogh styleobject classes (church, parachute)

Applications

text-to-image generationmachine unlearning red-teaming

Read PDF arXiv DOI Code

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face

T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling

When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

Unveiling the Attribute Misbinding Threat in Identity-Preserving Models