Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

Key Contributions

Identifies that VLM jailbreaks create a distinct representation state separate from benign and refusal states, refuting the 'safety perception failure' hypothesis
Defines 'jailbreak-related shift' (JRS) as the component of image-induced representation shift that steers models toward jailbreak states
Proposes JRS-Rem defense that removes jailbreak-related shifts at inference time while preserving benign task performance

🛡️ Threat Analysis

Input Manipulation Attack

Paper analyzes adversarial visual inputs (images) that manipulate VLM behavior to bypass safety alignment. The attack vector is the visual modality causing representation shifts that lead to jailbreak outputs, which is input manipulation at inference time.

Details

Domains

multimodalvisionnlp

Model Types

vlmtransformermultimodal

Threat Tags

inference_timewhite_box

Datasets

HADES

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

Risk-adaptive Activation Steering for Safe Multimodal Large Language Models

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Randomized Smoothing Meets Vision-Language Models

Adversarial attacks against Modern Vision-Language Models

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security