benchmark 2026

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

Zhangyun Tan , Zeliang Zhang , Susan Liang , Yolo Yunlong Tang , Lisha Chen , Chenliang Xu

0 citations

α

Published on arXiv

2604.03114

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Training-free unlearning prompts leave forget accuracy near no-instruction baseline; meaningful suppression only occurs under oracle conditions. Object and scene concepts most resistant to suppression.

VLM-UnBench

Novel technique introduced


VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.


Key Contributions

  • First benchmark (VLM-UnBench) for evaluating training-free visual concept unlearning in VLMs, spanning 4 forgetting levels, 7 datasets, and 11 concept axes
  • Three-level probe taxonomy (P1-P3) with five evaluation conditions designed to distinguish genuine forgetting from instruction compliance
  • Demonstrates that realistic unlearning prompts fail to suppress visual concepts — forget accuracy remains near baseline except under oracle conditions that reveal the target concept

🛡️ Threat Analysis

Model Inversion Attack

The paper evaluates unlearning methods against the threat model of residual knowledge extraction — specifically, whether visual concepts that should be 'forgotten' can still be extracted from the model. The benchmark's core contribution is distinguishing genuine forgetting (where the model cannot recognize the concept) from instruction compliance (where the model can still recognize but follows instructions not to report it). This is a defense evaluation against model inversion/knowledge extraction, measuring whether suppressed concepts remain extractable through different probe strategies.


Details

Domains
visionmultimodal
Model Types
vlmmultimodaltransformer
Threat Tags
inference_time
Datasets
MMMUVLM-UnBench (7 source datasets covering objects, scenes, attributes, privacy)
Applications
privacy compliancecopyright protectionvisual concept suppressionidentity removal