Latest papers

2 papers
benchmark arXiv Feb 4, 2026 · 8w ago

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Casey Ford, Madison Van Doren, Emily Dix · Appen

Longitudinal red-team benchmark reveals unstable alignment across MLLM generations, with GPT and Claude showing increased attack success rates over time

Prompt Injection nlpmultimodal
PDF
benchmark AAAI 2026 AIGOV Workshop and E... Sep 18, 2025 · Sep 2025

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Madison Van Doren, Casey Ford · Appen

Human red-team benchmark of 4 MLLMs across 726 adversarial prompts finds Pixtral 12B most vulnerable at ~62% harm rate vs Claude's ~10%

Prompt Injection nlpmultimodal
PDF