defense 2025

SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Juan Ren , Mark Dras , Usman Naseem

4 citations · 35 references · arXiv

α

Published on arXiv

2510.13190

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SHIELD consistently lowers jailbreak and non-following rates across all five evaluated LVLMs and five benchmarks with negligible computational overhead and no retraining required.

SHIELD

Novel technique introduced


Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.


Key Contributions

  • Fine-grained harmful-content taxonomy that maps each safety category to explicit 'should do / should not do' policies and one of three explicit actions (Block, Reframe, Forward)
  • Plug-and-play preprocessing framework requiring no LVLM retraining, integrating seamlessly with both weakly and strongly aligned models
  • Comprehensive evaluation across five benchmarks and five LVLMs demonstrating consistent reduction in jailbreak and non-following rates while preserving utility

🛡️ Threat Analysis

Input Manipulation Attack

SHIELD defends against adversarial visual inputs to VLMs (Type I: pixel-level adversarial perturbations on images that jailbreak LVLMs), qualifying for the dual ML01+LLM01 tag per the multimodal adversarial input guideline.


Details

Domains
visionnlpmultimodal
Model Types
vlmllmmultimodal
Threat Tags
inference_timedigitalblack_box
Datasets
MM-SafetyBenchJailBreakVXSTest
Applications
large vision-language modelsmultimodal ai safetycontent moderation