CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs
Yuxuan Liu 1, Yuntian Shi 2, Kun Wang 3, Haoting Shen 1, Kun Yang 1
Published on arXiv
2602.03263
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
All 16 evaluated MLLMs show systematic cross-modal alignment gaps, with weak safety awareness and evidence that apparent safety gains stem from refusal heuristics rather than robust intent understanding.
CSR-Bench
Novel technique introduced
Multimodal large language models (MLLMs) enable interaction over both text and images, but their safety behavior can be driven by unimodal shortcuts instead of true joint intent understanding. We introduce CSR-Bench, a benchmark for evaluating cross-modal reliability through four stress-testing interaction patterns spanning Safety, Over-rejection, Bias, and Hallucination, covering 61 fine-grained types. Each instance is constructed to require integrated image-text interpretation, and we additionally provide paired text-only controls to diagnose modality-induced behavior shifts. We evaluate 16 state-of-the-art MLLMs and observe systematic cross-modal alignment gaps. Models show weak safety awareness, strong language dominance under interference, and consistent performance degradation from text-only controls to multimodal inputs. We also observe a clear trade-off between reducing over-rejection and maintaining safe, non-discriminatory behavior, suggesting that some apparent safety gains may come from refusal-oriented heuristics rather than robust intent understanding. WARNING: This paper contains unsafe contents.
Key Contributions
- CSR-Bench: a cross-modal safety/reliability benchmark with 61 fine-grained types across Safety, Over-rejection, Bias, and Hallucination dimensions requiring integrated image-text interpretation
- Paired text-only controls for diagnosing modality-induced behavior shifts in MLLMs
- Systematic evaluation of 16 state-of-the-art MLLMs revealing cross-modal alignment gaps, language dominance under interference, and a safety/over-rejection trade-off