defense arXiv Feb 27, 2026 · 5w ago
Xingyu Zhu, Beier Zhu, Junfeng Fang et al. · University of Science and Technology of China · Nanyang Technological University +2 more
Training-free defense for VLMs uses optimal transport patch detection and attention calibration to block visual jailbreaks
Input Manipulation Attack Prompt Injection visionnlpmultimodal
Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.
vlm llm multimodal University of Science and Technology of China · Nanyang Technological University · National University of Singapore +1 more
defense arXiv Feb 12, 2026 · 7w ago
Zhaoxin Wang, Jiaming Liang, Fengbin Zhu et al. · Xidian University · National University of Singapore +1 more
Defends LLM safety alignment against neuron pruning attacks by redistributing safety representations across the network via selective neuron freezing
Prompt Injection nlpmultimodal
Large language models (LLMs) and multimodal LLMs are typically safety-aligned before release to prevent harmful content generation. However, recent studies show that safety behaviors are concentrated in a small subset of parameters, making alignment brittle and easily bypassed through neuron-level attacks. Moreover, most existing alignment methods operate at the behavioral level, offering limited control over the model's internal safety mechanisms. In this work, we propose SafeNeuron, a neuron-level safety alignment framework that improves robustness by redistributing safety representations across the network. SafeNeuron first identifies safety-related neurons, then freezes these neurons during preference optimization to prevent reliance on sparse safety pathways and force the model to construct redundant safety representations. Extensive experiments across models and modalities demonstrate that SafeNeuron significantly improves robustness against neuron pruning attacks, reduces the risk of open-source models being repurposed as red-team generators, and preserves general capabilities. Furthermore, our layer-wise analysis reveals that safety behaviors are governed by stable and shared internal representations. Overall, SafeNeuron provides an interpretable and robust perspective for model alignment.
llm vlm transformer Xidian University · National University of Singapore · Harbin Institute of Technology