Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
Zhangyun Tan , Zeliang Zhang , Susan Liang , Yolo Yunlong Tang , Lisha Chen , Chenliang Xu
Published on arXiv
2604.03114
Model Inversion Attack
OWASP ML Top 10 — ML03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Training-free unlearning prompts leave forget accuracy near no-instruction baseline; meaningful suppression only occurs under oracle conditions. Object and scene concepts most resistant to suppression.
VLM-UnBench
Novel technique introduced
VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.
Key Contributions
- First benchmark (VLM-UnBench) for evaluating training-free visual concept unlearning in VLMs, spanning 4 forgetting levels, 7 datasets, and 11 concept axes
- Three-level probe taxonomy (P1-P3) with five evaluation conditions designed to distinguish genuine forgetting from instruction compliance
- Demonstrates that realistic unlearning prompts fail to suppress visual concepts — forget accuracy remains near baseline except under oracle conditions that reveal the target concept
🛡️ Threat Analysis
The paper evaluates unlearning methods against the threat model of residual knowledge extraction — specifically, whether visual concepts that should be 'forgotten' can still be extracted from the model. The benchmark's core contribution is distinguishing genuine forgetting (where the model cannot recognize the concept) from instruction compliance (where the model can still recognize but follows instructions not to report it). This is a defense evaluation against model inversion/knowledge extraction, measuring whether suppressed concepts remain extractable through different probe strategies.