defense arXiv Mar 9, 2026 · 28d ago
Qishun Yang, Shu Yang, Lijie Hu et al. · King Abdullah University of Science and Technology · China University of Petroleum-Beijing +1 more
Defends VLMs against visual jailbreaks via label-free fine-tuning on neutral threat-image tasks to shape safety-oriented personas
Prompt Injection visionmultimodalnlp
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
vlm llm multimodal King Abdullah University of Science and Technology · China University of Petroleum-Beijing · Mohamed bin Zayed University of Artificial Intelligence
defense arXiv Mar 19, 2026 · 18d ago
Zikang Ding, Junhao Li, Suling Wu et al. · University of Electronic Science and Technology of China · Mohamed bin Zayed University of Artificial Intelligence +1 more
Embeds ownership watermarks in a low-dimensional functional subspace of LLM weights, surviving fine-tuning, quantization, and distillation attacks
Model Theft Model Theft nlp
Model watermarking utilizes internal representations to protect the ownership of large language models (LLMs). However, these features inevitably undergo complex distortions during realistic model modifications such as fine-tuning, quantization, or knowledge distillation, making reliable extraction extremely challenging. Despite extensive research on model-side watermarking, existing methods still lack sufficient robustness against parameter-level perturbations. To address this gap, we propose \texttt{\textbf{Functional Subspace Watermarking (FSW)}}, a framework that anchors ownership signals into a low-dimensional functional backbone. Specifically, we first solve a generalized eigenvalue problem to extract a stable functional subspace for watermark injection, while introducing an adaptive spectral truncation strategy to achieve an optimal balance between robustness and model utility. Furthermore, a vector consistency constraint is incorporated to ensure that watermark injection does not compromise the original semantic performance. Extensive experiments across various LLM architectures and datasets demonstrate that our method achieves superior detection accuracy and statistical verifiability under multiple model attacks, maintaining robustness that outperforms existing state-of-the-art (SOTA) methods.
llm transformer University of Electronic Science and Technology of China · Mohamed bin Zayed University of Artificial Intelligence · South China University of Technology