benchmark 2025

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Madison Van Doren , Casey Ford

0 citations · AAAI 2026 AIGOV Workshop and E...

α

Published on arXiv

2509.15478

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Pixtral 12B produced harmful responses in ~62% of cases vs Claude Sonnet 3.5's ~10%; contrary to expectations, text-only prompts were slightly more effective than multimodal prompts at bypassing safety mechanisms.


Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.


Key Contributions

  • Human red-team evaluation of GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus across 726 adversarial prompts in text-only and multimodal formats
  • Quantitative harmfulness rating by 17 annotators across 2,904 model outputs using a 5-point scale, covering three harm categories
  • Statistical analysis confirming that both model type and input modality are significant predictors of harmful output rates, with text-only prompts slightly outperforming multimodal ones at bypassing safety filters

🛡️ Threat Analysis


Details

Domains
nlpmultimodal
Model Types
llmvlm
Threat Tags
black_boxinference_time
Applications
multimodal chatbotsllm safety evaluation