defense arXiv Mar 9, 2026 · 28d ago
Qishun Yang, Shu Yang, Lijie Hu et al. · King Abdullah University of Science and Technology · China University of Petroleum-Beijing +1 more
Defends VLMs against visual jailbreaks via label-free fine-tuning on neutral threat-image tasks to shape safety-oriented personas
Prompt Injection visionmultimodalnlp
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
vlm llm multimodal King Abdullah University of Science and Technology · China University of Petroleum-Beijing · Mohamed bin Zayed University of Artificial Intelligence