attack 2026

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

0 citations

Published on arXiv

2603.06640

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Pruning-based unlearning in diffusion models is insecure: erased concepts can be fully revived without any data or retraining by exploiting the side-channel information embedded in pruned weight locations.

Pruning-based unlearning has recently emerged as a fast, training-free, and data-independent approach to remove undesired concepts from diffusion models. It promises high efficiency and robustness, offering an attractive alternative to traditional fine-tuning or editing-based unlearning. However, in this paper we uncover a hidden danger behind this promising paradigm. We find that the locations of pruned weights, typically set to zero during unlearning, can act as side-channel signals that leak critical information about the erased concepts. To verify this vulnerability, we design a novel attack framework capable of reviving erased concepts from pruned diffusion models in a fully data-free and training-free manner. Our experiments confirm that pruning-based unlearning is not inherently secure, as erased concepts can be effectively revived without any additional data or retraining. Extensive experiments on diffusion-based unlearning based on concept related weights lead to the conclusion: once the critical concept-related weights in diffusion models are identified, our method can effectively recover the original concept regardless of how the weights are manipulated. Finally, we explore potential defense strategies and advocate safer pruning mechanisms that conceal pruning locations while preserving unlearning effectiveness, providing practical insights for designing more secure pruning-based unlearning frameworks.

Key Contributions

Discovers that locations of pruned (zeroed) weights in diffusion models act as side-channel signals that leak information about erased concepts
Proposes a data-free, training-free concept revival attack framework exploiting these side-channel signals to restore erased concept generation capability
Explores defense strategies that conceal pruning locations to prevent concept revival while preserving unlearning effectiveness

🛡️ Threat Analysis

Output Integrity Attack

Pruning-based unlearning is a content safety control that suppresses certain model outputs; the attack defeats this protection by exploiting pruned weight locations as side-channel signals to restore concept generation capability — directly attacking the output integrity guarantee of the unlearning mechanism. Analogous to 'attacks that remove or defeat image protections,' this removes a generative content restriction.

Details

Domains

generativevision

Model Types

diffusion

Threat Tags

white_boxinference_time

Applications

image generationcontent moderation in diffusion models

Read PDF arXiv

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!

Removal Attack and Defense on AI-generated Content Latent-based Watermarking

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

Identifying Models Behind Text-to-Image Leaderboards

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge

PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting

Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity