JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation
Published on arXiv
2509.21401
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
JaiLIP achieves higher toxicity scores than PGD-based baselines on MiniGPT-4 and BLIP-2 while producing visually imperceptible adversarial images.
JaiLIP
Novel technique introduced
Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.
Key Contributions
- Proposes JaiLIP, a joint objective combining MSE perturbation loss and harmful-output loss to generate imperceptible adversarial images that jailbreak VLMs
- Demonstrates higher attack success rates and toxicity scores than PGD-based baselines on MiniGPT-4 and BLIP-2
- Validates attack generalizability in the transportation domain beyond general toxicity benchmarks
🛡️ Threat Analysis
JaiLIP crafts adversarial visual inputs using gradient-based optimization (minimizing a joint MSE + harmful-output loss) to cause VLMs to produce harmful outputs at inference time — a classic adversarial input manipulation attack in the image space.