defense 2026

A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models

Duc Hao Pham , Van Duy Truong , Duy Khanh Dinh , Tien Cuong Nguyen , Dien Hy Ngo , Tuan Anh Bui

0 citations

α

Published on arXiv

2603.18767

Model Inversion Attack

OWASP ML Top 10 — ML03

Key Finding

Achieves stronger concept erasure and improved robustness against adversarial recovery attacks while better retaining unrelated concepts compared to keyword-based unlearning baselines

Diversified Unlearning

Novel technique introduced


Concept unlearning has emerged as a promising direction for reducing the risks of harmful content generation in text-to-image diffusion models by selectively erasing undesirable concepts from a model's parameters. Existing approaches typically rely on keywords to identify the target concept to be unlearned. However, we show that this keyword-based formulation is inherently limited: a visual concept is multi-dimensional, can be expressed in diverse textual forms, and often overlap with related concepts in the latent space, making keyword-only unlearning, which imprecisely indicate the target concept is brittle and prone to over-forgetting. This occurs because a single keyword represents only a narrow point estimate of the concept, failing to cover its full semantic distribution and entangled variations in the latent space. To address this limitation, we propose Diversified Unlearning, a distributional framework that represents a concept through a set of contextually diverse prompts rather than a single keyword. This richer representation enables more precise and robust unlearning. Through extensive experiments across multiple benchmarks and state-of-the-art baselines, we demonstrate that integrating Diversified Unlearning as an add-on component into existing unlearning pipelines consistently achieves stronger erasure, better retention of unrelated concepts, and improved robustness against adversarial recovery attacks.


Key Contributions

  • Proposes distributional concept representation using contextually diverse prompts instead of single keywords for more precise unlearning
  • Demonstrates that keyword-based unlearning suffers from over-forgetting due to concept entanglement in latent space
  • Shows consistent improvements in erasure strength, concept retention, and robustness against adversarial recovery across multiple baselines

🛡️ Threat Analysis

Model Inversion Attack

The paper addresses concept unlearning in diffusion models and explicitly evaluates robustness against adversarial recovery attacks. The threat model involves an adversary attempting to recover/reconstruct erased training concepts (harmful content) that should have been removed. The paper's primary security contribution is defending against such recovery attempts through more robust unlearning.


Details

Domains
visiongenerativemultimodal
Model Types
diffusiontransformer
Threat Tags
training_timetargeted
Datasets
I2PUnlearnCanvas
Applications
text-to-image generationharmful content prevention