benchmark 2025

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia ¹, Jie Liao ^2,3, Qi Guo ^2,4, Teng Ma ^2,5, Simeng Qin ^2,6, Ranjie Duan ⁷, Tianlin Li ¹, Yihao Huang ¹, Zhitao Zeng ⁸, Dongxian Wu ⁹, Yiming Li ¹, Wenqi Ren ⁵, Xiaochun Cao ⁵, Yang Liu ¹

¹ Nanyang Technological University

² BraneMatrix AI

³ Chongqing University

⁴ Xi’an Jiaotong University

⁵ Sun Yat-sen University

⁶ Northeastern University

⁷ Alibaba

⁸ National University of Singapore

⁹ ByteDance

5 citations · 54 references · arXiv

Published on arXiv

2512.06589

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Comprehensive evaluation of 18 MLLMs reveals systematic vulnerabilities to multimodal jailbreak attacks across all tested models, with both open-source and closed-source systems susceptible under the unified three-dimensional evaluation framework.

OmniSafeBench-MM

Novel technique introduced

Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.

Key Contributions

OmniSafeBench-MM toolbox integrating 13 jailbreak attack methods and 15 defense strategies in a single reproducible platform
Diverse dataset spanning 9 major risk domains and 50 fine-grained categories across consultative, imperative, and declarative query types
Three-dimensional evaluation protocol measuring harmfulness severity, intent alignment, and response detail level, tested on 10 open-source and 8 closed-source MLLMs

🛡️ Threat Analysis

Details

Domains

multimodalnlpvision

Model Types

vlmllmmultimodal

Threat Tags

black_boxgrey_boxwhite_boxinference_time

Datasets

JailBreakV-28KMM-SafetyBenchHADESOmniSafeBench-MM (proposed)

Applications

multimodal large language modelsvision-language modelssafety alignment evaluation

Read PDF arXiv DOI Code

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Jailbreaking Large Vision Language Models in Intelligent Transportation Systems