benchmark 2025

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan ^1,2, Zheyu Fu ¹, Yunpeng Zhai ², Shuchang Tao ², Sheng Guan ¹, Shiyu Huang ³, Lingzhe Zhang ^2,4, Zhaoyang Liu ², Bolin Ding ², Felix Henry ³, Aiwei Liu ¹, Lijie Wen ¹

¹ Tsinghua University

² Tongyi Lab

³ OpenRL Lab

⁴ Peking University

0 citations

Published on arXiv

2508.07173

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Only 3 of 10 evaluated OLLMs achieve >0.6 on both Safety-score and CMSC-score, with some models scoring as low as 0.14 on specific modalities, and safety defenses consistently weaken under complex audio-visual joint inputs.

Omni-SafetyBench

Novel technique introduced

The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and existing benchmarks fail to assess safety under joint audio-visual inputs or cross-modal consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality variations with 972 samples each, including audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on Conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) only 3 models achieving over 0.6 in both average Safety-score and CMSC-score; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Using Omni-SafetyBench, we evaluated existing safety alignment algorithms and identified key challenges in OLLM safety alignment: (1) Inference-time methods are inherently less effective as they cannot alter the model's underlying understanding of safety; (2) Post-training methods struggle with out-of-distribution issues due to the vast modality combinations in OLLMs; and, safety tasks involving audio-visual inputs are more complex, making even in-distribution training data less effective. Our proposed benchmark, metrics and the findings highlight urgent needs for enhanced OLLM safety.

Key Contributions

Omni-SafetyBench: first comprehensive parallel safety benchmark for OLLMs with 24 modality variations and 972 samples each, including audio-visual joint harm cases
Novel tailored metrics: Conditional Attack Success Rate (C-ASR), Conditional Refusal Rate (C-RR), and Cross-Modal Safety Consistency score (CMSC-score) to account for comprehension failures and cross-modal consistency
Evaluation of 10 OLLMs revealing that safety defenses degrade significantly under complex audio-visual inputs, and analysis of why both inference-time and post-training alignment methods struggle for OLLMs

🛡️ Threat Analysis

Details

Domains

multimodalnlpaudiovision

Model Types

llmmultimodal

Threat Tags

inference_timeblack_box

Datasets

Omni-SafetyBench

Applications

omni-modal llmsaudio-visual language modelsmultimodal ai safety evaluation

Read PDF arXiv Code

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs