Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
Zhihua Wei 1, Qiang Li 1, Jian Ruan 1, Zhenxin Qin 1, Leilei Wen 1, Dongrui Liu 2, Wen Shen 1
Published on arXiv
2603.17372
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
JRS-Rem provides strong defense against multiple VLM jailbreak scenarios while maintaining performance on benign tasks; LLaVA-1.5-7B shows 28.13% jailbreak success rate increase with blank images on HADES dataset
JRS-Rem
Novel technique introduced
Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
Key Contributions
- Identifies that VLM jailbreaks create a distinct representation state separate from benign and refusal states, refuting the 'safety perception failure' hypothesis
- Defines 'jailbreak-related shift' (JRS) as the component of image-induced representation shift that steers models toward jailbreak states
- Proposes JRS-Rem defense that removes jailbreak-related shifts at inference time while preserving benign task performance
🛡️ Threat Analysis
Paper analyzes adversarial visual inputs (images) that manipulate VLM behavior to bypass safety alignment. The attack vector is the visual modality causing representation shifts that lead to jailbreak outputs, which is input manipulation at inference time.