Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

Key Contributions

Safety Subnetwork Hypothesis: VLMs contain dormant safety-enforcing pathways activated by safety prompts
Safety-Potential Pruning: one-shot pruning method that removes weights less responsive to safety prompts
Achieves up to 22% reduction in attack success rates across three VLM architectures without retraining