defense 2026

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Chongxin Li , Hanzhang Wang , Lian Duan

0 citations

α

Published on arXiv

2603.14219

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces jailbreak attack success rates by up to 22% relative to safety prompting alone while maintaining benign performance across three VLM architectures

Safety-Potential Pruning

Novel technique introduced


Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.


Key Contributions

  • Safety Subnetwork Hypothesis: VLMs contain dormant safety-enforcing pathways activated by safety prompts
  • Safety-Potential Pruning: one-shot pruning method that removes weights less responsive to safety prompts
  • Achieves up to 22% reduction in attack success rates across three VLM architectures without retraining

🛡️ Threat Analysis


Details

Domains
multimodalvisionnlp
Model Types
vlmmultimodaltransformer
Threat Tags
inference_time
Applications
vision-language modelsmultimodal ai safety