LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
Guolei Huang 1,2, Qinzhi Peng 3,2, Gan Xu 4,2, Yao Huang 5,2, Yuxuan Lu 2, Yongjun Shen 1
Published on arXiv
2509.25896
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
LLaVAShield consistently outperforms strong baselines on multimodal multi-turn content moderation tasks under dynamic policy configurations, establishing new state-of-the-art results
LLaVAShield
Novel technique introduced
As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.
Key Contributions
- First systematic definition and study of Multimodal Multi-Turn (MMT) dialogue safety, identifying risks from malicious intent spread across turns and modalities
- MMDS dataset: 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy labels, and evidence-based rationales, generated via an MCTS-based automated red-teaming framework
- LLaVAShield: a content moderation tool that jointly detects and assesses risk in both user inputs and assistant responses across multi-turn VLM dialogues, achieving state-of-the-art results