Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.

Key Contributions

First benchmark (VLM-UnBench) for evaluating training-free visual concept unlearning in VLMs, spanning 4 forgetting levels, 7 datasets, and 11 concept axes
Three-level probe taxonomy (P1-P3) with five evaluation conditions designed to distinguish genuine forgetting from instruction compliance
Demonstrates that realistic unlearning prompts fail to suppress visual concepts — forget accuracy remains near baseline except under oracle conditions that reveal the target concept

🛡️ Threat Analysis

Model Inversion Attack

The paper evaluates unlearning methods against the threat model of residual knowledge extraction — specifically, whether visual concepts that should be 'forgotten' can still be extracted from the model. The benchmark's core contribution is distinguishing genuine forgetting (where the model cannot recognize the concept) from instruction compliance (where the model can still recognize but follows instructions not to report it). This is a defense evaluation against model inversion/knowledge extraction, measuring whether suppressed concepts remain extractable through different probe strategies.