VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts

Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).

Key Contributions

Proposes MAFE (Multimodal Aggregated Feature Extraction) framework enabling CLIP to handle long text and fuse multimodal information
Discovers distinct distributional patterns between benign and malicious prompts in MAFE-extracted features
Develops VLMShield as a lightweight plug-and-play safety detector for identifying multimodal malicious attacks on VLMs

🛡️ Threat Analysis

Input Manipulation Attack

Defends against adversarial visual inputs to VLMs - the paper addresses malicious prompts that include both visual and textual components designed to manipulate VLM behavior. The multimodal attack vector involves adversarial manipulation of inputs at inference time.

Details

Domains

multimodalvisionnlp

Model Types

vlmmultimodaltransformer

Threat Tags

inference_timeblack_box

Applications

2026 0 cit.

Input Manipulation Attack

89%