Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases
Casey Ford , Madison Van Doren , Emily Dix
Published on arXiv
2602.04739
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Attack success rates showed clear alignment drift: GPT and Claude models became MORE vulnerable across generations, while Pixtral and Qwen modestly improved, demonstrating that MLLM safety is neither uniform nor stable across model updates.
Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.
Key Contributions
- Fixed longitudinal benchmark of 726 adversarial prompts from 26 professional red teamers, enabling cross-generation safety comparisons across 8 MLLM releases
- Discovery of 'alignment drift': GPT and Claude models showed increased attack success rates across generations, while Pixtral and Qwen showed modest decreases
- 82,256 human harm ratings revealing persistent inter-family differences in harmlessness and shifting modality vulnerability patterns over time