α

Published on arXiv

2510.22535

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

All evaluated MLLM unlearning methods fail adversarial scrutiny: supposedly-erased misinformation is easily recoverable and every method is vulnerable to prompt attacks, exposing fundamental gaps in current approaches.

OFFSIDE

Novel technique introduced


Advances in Multimodal Large Language Models (MLLMs) intensify concerns about data privacy, making Machine Unlearning (MU), the selective removal of learned information, a critical necessity. However, existing MU benchmarks for MLLMs are limited by a lack of image diversity, potential inaccuracies, and insufficient evaluation scenarios, which fail to capture the complexity of real-world applications. To facilitate the development of MLLMs unlearning and alleviate the aforementioned limitations, we introduce OFFSIDE, a novel benchmark for evaluating misinformation unlearning in MLLMs based on football transfer rumors. This manually curated dataset contains 15.68K records for 80 players, providing a comprehensive framework with four test sets to assess forgetting efficacy, generalization, utility, and robustness. OFFSIDE supports advanced settings like selective unlearning and corrective relearning, and crucially, unimodal unlearning (forgetting only text data). Our extensive evaluation of multiple baselines reveals key findings: (1) Unimodal methods (erasing text-based knowledge) fail on multimodal rumors; (2) Unlearning efficacy is largely driven by catastrophic forgetting; (3) All methods struggle with "visual rumors" (rumors appear in the image); (4) The unlearned rumors can be easily recovered and (5) All methods are vulnerable to prompt attacks. These results expose significant vulnerabilities in current approaches, highlighting the need for more robust multimodal unlearning solutions. The code is available at https://github.com/zh121800/OFFSIDE


Key Contributions

  • OFFSIDE: a manually curated 15.68K-record multimodal benchmark for evaluating misinformation unlearning in MLLMs across four test sets (forgetting efficacy, generalization, utility, robustness)
  • Reveals that unimodal (text-only) unlearning methods fail on multimodal rumors, and that all tested methods are vulnerable to adversarial recovery and prompt attacks on supposedly-erased knowledge
  • Introduces advanced evaluation settings including selective unlearning, corrective relearning, and unimodal unlearning specific to the multimodal context

🛡️ Threat Analysis


Details

Domains
multimodalnlpvision
Model Types
vlmllmmultimodal
Threat Tags
inference_timeblack_box
Datasets
OFFSIDE (15.68K records, 80 football players)
Applications
multimodal large language modelsmisinformation removalmachine unlearning evaluation