defense 2026

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang 1,2, Shu Yang 1, Lijie Hu 3, Di Wang 1

0 citations

α

Published on arXiv

2603.08486

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

VSFA reduces attack success rate and mitigates over-refusal in VLMs without any safety labels by fine-tuning on neutral VQA tasks constructed around threat-related images.

VSFA (Visual Self-Fulfilling Alignment)

Novel technique introduced


Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.


Key Contributions

  • VSFA: a label-free VLM safety alignment method that fine-tunes on neutral VQA tasks built around threat-related images, requiring no explicit safety labels or contrastive data
  • Extends the self-fulfilling alignment mechanism from text to the visual modality, enabling safety-oriented persona formation via implicit visual semantics
  • Empirically demonstrates reduced attack success rate, improved response quality, and mitigated over-refusal across multiple VLMs and safety benchmarks while preserving general capabilities

🛡️ Threat Analysis


Details

Domains
visionmultimodalnlp
Model Types
vlmllmmultimodal
Threat Tags
inference_timeblack_box
Applications
vision-language modelsmultimodal ai safety alignment