benchmark 2025

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan 1,2, Zheyu Fu 1, Yunpeng Zhai 2, Shuchang Tao 2, Sheng Guan 1, Shiyu Huang 3, Lingzhe Zhang 2,4, Zhaoyang Liu 2, Bolin Ding 2, Felix Henry 3, Aiwei Liu 1, Lijie Wen 1

0 citations

α

Published on arXiv

2508.07173

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Only 3 of 10 evaluated OLLMs achieve >0.6 on both Safety-score and CMSC-score, with some models scoring as low as 0.14 on specific modalities, and safety defenses consistently weaken under complex audio-visual joint inputs.

Omni-SafetyBench

Novel technique introduced


The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and existing benchmarks fail to assess safety under joint audio-visual inputs or cross-modal consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality variations with 972 samples each, including audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on Conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) only 3 models achieving over 0.6 in both average Safety-score and CMSC-score; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Using Omni-SafetyBench, we evaluated existing safety alignment algorithms and identified key challenges in OLLM safety alignment: (1) Inference-time methods are inherently less effective as they cannot alter the model's underlying understanding of safety; (2) Post-training methods struggle with out-of-distribution issues due to the vast modality combinations in OLLMs; and, safety tasks involving audio-visual inputs are more complex, making even in-distribution training data less effective. Our proposed benchmark, metrics and the findings highlight urgent needs for enhanced OLLM safety.


Key Contributions

  • Omni-SafetyBench: first comprehensive parallel safety benchmark for OLLMs with 24 modality variations and 972 samples each, including audio-visual joint harm cases
  • Novel tailored metrics: Conditional Attack Success Rate (C-ASR), Conditional Refusal Rate (C-RR), and Cross-Modal Safety Consistency score (CMSC-score) to account for comprehension failures and cross-modal consistency
  • Evaluation of 10 OLLMs revealing that safety defenses degrade significantly under complex audio-visual inputs, and analysis of why both inference-time and post-training alignment methods struggle for OLLMs

🛡️ Threat Analysis


Details

Domains
multimodalnlpaudiovision
Model Types
llmmultimodal
Threat Tags
inference_timeblack_box
Datasets
Omni-SafetyBench
Applications
omni-modal llmsaudio-visual language modelsmultimodal ai safety evaluation