defense 2026

VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts

Peigui Qi 1, Kunsheng Tang 1, Yanpu Yu 1, Jialin Wu 2, Yide Song 3, Wenbo Zhou 1, Zhicong Huang 2, Cheng Hong 2, Weiming Zhang 1, Nenghai Yu 1

0 citations

α

Published on arXiv

2604.06502

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Demonstrates superior performance in robustness, efficiency, and utility for detecting multimodal jailbreak attacks on VLMs

VLMShield

Novel technique introduced


Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).


Key Contributions

  • Proposes MAFE (Multimodal Aggregated Feature Extraction) framework enabling CLIP to handle long text and fuse multimodal information
  • Discovers distinct distributional patterns between benign and malicious prompts in MAFE-extracted features
  • Develops VLMShield as a lightweight plug-and-play safety detector for identifying multimodal malicious attacks on VLMs

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial visual inputs to VLMs - the paper addresses malicious prompts that include both visual and textual components designed to manipulate VLM behavior. The multimodal attack vector involves adversarial manipulation of inputs at inference time.


Details

Domains
multimodalvisionnlp
Model Types
vlmmultimodaltransformer
Threat Tags
inference_timeblack_box
Applications
vision-language modelsmultimodal ai safety