Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.

Key Contributions

Human red-team evaluation of GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus across 726 adversarial prompts in text-only and multimodal formats
Quantitative harmfulness rating by 17 annotators across 2,904 model outputs using a 5-point scale, covering three harm categories
Statistical analysis confirming that both model type and input modality are significant predictors of harmful output rates, with text-only prompts slightly outperforming multimodal ones at bypassing safety filters