VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts
Peigui Qi 1, Kunsheng Tang 1, Yanpu Yu 1, Jialin Wu 2, Yide Song 3, Wenbo Zhou 1, Zhicong Huang 2, Cheng Hong 2, Weiming Zhang 1, Nenghai Yu 1
Published on arXiv
2604.06502
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Demonstrates superior performance in robustness, efficiency, and utility for detecting multimodal jailbreak attacks on VLMs
VLMShield
Novel technique introduced
Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).
Key Contributions
- Proposes MAFE (Multimodal Aggregated Feature Extraction) framework enabling CLIP to handle long text and fuse multimodal information
- Discovers distinct distributional patterns between benign and malicious prompts in MAFE-extracted features
- Develops VLMShield as a lightweight plug-and-play safety detector for identifying multimodal malicious attacks on VLMs
🛡️ Threat Analysis
Defends against adversarial visual inputs to VLMs - the paper addresses malicious prompts that include both visual and textual components designed to manipulate VLM behavior. The multimodal attack vector involves adversarial manipulation of inputs at inference time.