JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

Key Contributions

Proposes JaiLIP, a joint objective combining MSE perturbation loss and harmful-output loss to generate imperceptible adversarial images that jailbreak VLMs
Demonstrates higher attack success rates and toxicity scores than PGD-based baselines on MiniGPT-4 and BLIP-2
Validates attack generalizability in the transportation domain beyond general toxicity benchmarks

🛡️ Threat Analysis

Input Manipulation Attack

JaiLIP crafts adversarial visual inputs using gradient-based optimization (minimizing a joint MSE + harmful-output loss) to cause VLMs to produce harmful outputs at inference time — a classic adversarial input manipulation attack in the image space.

Details

Domains

visionmultimodalnlp

Model Types

vlmllm

Threat Tags

white_boxinference_timetargeteddigital

Datasets

Perspective APIDetoxify

Applications

2026 0 cit.

Input Manipulation Attack

90%