defense 2025

SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

Xiyu Zeng ¹, Siyuan Liang ², Liming Lu ¹, Haotian Zhu ¹, Enguang Liu ¹, Jisheng Dang ³, Yongbin Zhou ¹, Shuchao Pang ¹

¹ Nanjing University of Science and Technology

² Nanyang Technological University

³ National University of Singapore

1 citations · 44 references · arXiv

Published on arXiv

2509.21400

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SafeSteer reduces attack success rate by over 60% across diverse jailbreak attacks while improving benign task accuracy by 1–2% and introducing negligible inference latency.

SafeSteer

Novel technique introduced

As the capabilities of Vision Language Models (VLMs) continue to improve, they are increasingly targeted by jailbreak attacks. Existing defense methods face two major limitations: (1) they struggle to ensure safety without compromising the model's utility; and (2) many defense mechanisms significantly reduce the model's inference efficiency. To address these challenges, we propose SafeSteer, a lightweight, inference-time steering framework that effectively defends against diverse jailbreak attacks without modifying model weights. At the core of SafeSteer is the innovative use of Singular Value Decomposition to construct a low-dimensional "safety subspace." By projecting and reconstructing the raw steering vector into this subspace during inference, SafeSteer adaptively removes harmful generation signals while preserving the model's ability to handle benign inputs. The entire process is executed in a single inference pass, introducing negligible overhead. Extensive experiments show that SafeSteer reduces the attack success rate by over 60% and improves accuracy on normal tasks by 1-2%, without introducing significant inference latency. These results demonstrate that robust and practical jailbreak defense can be achieved through simple, efficient inference-time control.

Key Contributions

Proposes SafeSteer, a lightweight inference-time activation steering framework that uses SVD to construct a low-dimensional 'safety subspace' from activation difference vectors
Projects and reconstructs steering vectors into the safety subspace during inference to adaptively remove harmful generation signals without modifying model weights
Achieves over 60% reduction in attack success rate with 1–2% improvement in benign task accuracy and negligible latency overhead in a single inference pass

🛡️ Threat Analysis

Input Manipulation Attack

The paper defends against adversarial visual perturbations and structured image-based attacks that jailbreak VLMs — these are adversarial inputs manipulating model outputs at inference time via the visual modality, triggering the dual ML01+LLM01 tagging rule for adversarial visual inputs to VLMs.

Details

Domains

visionnlpmultimodal

Model Types

vlm

Threat Tags

inference_timedigital

Applications

vision-language modelsmultimodal ai assistants

Read PDF arXiv DOI

SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Directional Embedding Smoothing for Robust Vision Language Models

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Randomized Smoothing Meets Vision-Language Models

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs