Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining
Chongxin Li , Hanzhang Wang , Lian Duan
Published on arXiv
2603.14219
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Reduces jailbreak attack success rates by up to 22% relative to safety prompting alone while maintaining benign performance across three VLM architectures
Safety-Potential Pruning
Novel technique introduced
Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.
Key Contributions
- Safety Subnetwork Hypothesis: VLMs contain dormant safety-enforcing pathways activated by safety prompts
- Safety-Potential Pruning: one-shot pruning method that removes weights less responsive to safety prompts
- Achieves up to 22% reduction in attack success rates across three VLM architectures without retraining