ML Security Papers

Latest papers

2 papers

benchmark arXiv Feb 4, 2026 · 8w ago

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Casey Ford, Madison Van Doren, Emily Dix · Appen

Longitudinal red-team benchmark reveals unstable alignment across MLLM generations, with GPT and Claude showing increased attack success rates over time

Prompt Injection nlpmultimodal

PDF

benchmark AAAI 2026 AIGOV Workshop and E... Sep 18, 2025 · Sep 2025

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Madison Van Doren, Casey Ford · Appen

Human red-team benchmark of 4 MLLMs across 726 adversarial prompts finds Pixtral 12B most vulnerable at ~62% harm rate vs Claude's ~10%

Prompt Injection nlpmultimodal

PDF

Latest papers

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue