defense 2026

Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

Enyi Shi 1,2, Fei Shen 1,2, Shuyi Miao 1,3, Linxia Zhu 1,2, Pengyang Shao 4, Jinhui Tang 1, Tat-Seng Chua 2

0 citations

α

Published on arXiv

2604.08881

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Substantially improves VLLM safety against multilingual/multimodal attacks while preserving generalization by updating fewer than 0.03% of parameters

Precise Shield

Novel technique introduced


In real-world deployments, Vision-Language Large Models (VLLMs) face critical challenges from multilingual and multimodal composite attacks: harmful images paired with low-resource language texts can easily bypass defenses designed for high-resource language scenarios, exposing structural blind spots in current cross-lingual and cross-modal safety methods. This raises a mechanistic question: where is safety capability instantiated within the model, and how is it distributed across languages and modalities? Prior studies on pure-text LLMs have identified cross-lingual shared safety neurons, suggesting that safety may be governed by a small subset of critical neurons. Leveraging this insight, we propose Precise Shield, a two-stage framework that first identifies safety neurons by contrasting activation patterns between harmful and benign inputs, and then constrains parameter updates strictly within this subspace via gradient masking with affecting fewer than 0.03% of parameters. This strategy substantially improves safety while preserving multilingual and multimodal generalization. Further analysis reveals a moderate overlap of safety neurons across languages and modalities, enabling zero-shot cross-lingual and cross-modal transfer of safety capabilities, and offering a new direction for neuron-level, transfer-based safety enhancement.


Key Contributions

  • Two-stage framework identifying safety neurons via activation pattern analysis between harmful/benign inputs
  • Gradient masking technique constraining fine-tuning to <0.03% of parameters (safety neuron subspace only)
  • Demonstrates moderate cross-lingual and cross-modal safety neuron overlap enabling zero-shot transfer

🛡️ Threat Analysis

Input Manipulation Attack

Paper addresses adversarial multimodal inputs (harmful images + text) designed to bypass VLLM safety mechanisms at inference time — these are composite adversarial attacks manipulating model behavior.


Details

Domains
multimodalnlpvision
Model Types
vlmmultimodaltransformer
Threat Tags
inference_timetargeted
Applications
vision-language modelsmultilingual safetymultimodal content moderation