benchmark 2025

OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

Hao Zheng ¹, Zirui Pang ², Ling li ³, Zhijie Deng ³, Yuhan Pu ³, Zhaowei Zhu ^4,5,6, Xiaobo Xia ⁷, Jiaheng Wei ³

¹ Harbin Institute of Technology

² University of Illinois Urbana-Champaign

³ The Hong Kong University of Science and Technology (Guangzhou)

⁴ BIAI

⁵ Zhejiang University of Technology

⁶ D5Data.ai

⁷ National University of Singapore

0 citations · 62 references · arXiv

Published on arXiv

2510.22535

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

All evaluated MLLM unlearning methods fail adversarial scrutiny: supposedly-erased misinformation is easily recoverable and every method is vulnerable to prompt attacks, exposing fundamental gaps in current approaches.

OFFSIDE

Novel technique introduced

Advances in Multimodal Large Language Models (MLLMs) intensify concerns about data privacy, making Machine Unlearning (MU), the selective removal of learned information, a critical necessity. However, existing MU benchmarks for MLLMs are limited by a lack of image diversity, potential inaccuracies, and insufficient evaluation scenarios, which fail to capture the complexity of real-world applications. To facilitate the development of MLLMs unlearning and alleviate the aforementioned limitations, we introduce OFFSIDE, a novel benchmark for evaluating misinformation unlearning in MLLMs based on football transfer rumors. This manually curated dataset contains 15.68K records for 80 players, providing a comprehensive framework with four test sets to assess forgetting efficacy, generalization, utility, and robustness. OFFSIDE supports advanced settings like selective unlearning and corrective relearning, and crucially, unimodal unlearning (forgetting only text data). Our extensive evaluation of multiple baselines reveals key findings: (1) Unimodal methods (erasing text-based knowledge) fail on multimodal rumors; (2) Unlearning efficacy is largely driven by catastrophic forgetting; (3) All methods struggle with "visual rumors" (rumors appear in the image); (4) The unlearned rumors can be easily recovered and (5) All methods are vulnerable to prompt attacks. These results expose significant vulnerabilities in current approaches, highlighting the need for more robust multimodal unlearning solutions. The code is available at https://github.com/zh121800/OFFSIDE

Key Contributions

OFFSIDE: a manually curated 15.68K-record multimodal benchmark for evaluating misinformation unlearning in MLLMs across four test sets (forgetting efficacy, generalization, utility, robustness)
Reveals that unimodal (text-only) unlearning methods fail on multimodal rumors, and that all tested methods are vulnerable to adversarial recovery and prompt attacks on supposedly-erased knowledge
Introduces advanced evaluation settings including selective unlearning, corrective relearning, and unimodal unlearning specific to the multimodal context

🛡️ Threat Analysis

Details

Domains

multimodalnlpvision

Model Types

vlmllmmultimodal

Threat Tags

inference_timeblack_box

Datasets

OFFSIDE (15.68K records, 80 football players)

Applications

multimodal large language modelsmisinformation removalmachine unlearning evaluation

Read PDF arXiv DOI Code

OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints