attack 2025

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Md Jueal Mia , M. Hadi Amini

0 citations · 34 references · arXiv

α

Published on arXiv

2509.21401

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

JaiLIP achieves higher toxicity scores than PGD-based baselines on MiniGPT-4 and BLIP-2 while producing visually imperceptible adversarial images.

JaiLIP

Novel technique introduced


Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.


Key Contributions

  • Proposes JaiLIP, a joint objective combining MSE perturbation loss and harmful-output loss to generate imperceptible adversarial images that jailbreak VLMs
  • Demonstrates higher attack success rates and toxicity scores than PGD-based baselines on MiniGPT-4 and BLIP-2
  • Validates attack generalizability in the transportation domain beyond general toxicity benchmarks

🛡️ Threat Analysis

Input Manipulation Attack

JaiLIP crafts adversarial visual inputs using gradient-based optimization (minimizing a joint MSE + harmful-output loss) to cause VLMs to produce harmful outputs at inference time — a classic adversarial input manipulation attack in the image space.


Details

Domains
visionmultimodalnlp
Model Types
vlmllm
Threat Tags
white_boxinference_timetargeteddigital
Datasets
Perspective APIDetoxify
Applications
vision-language modelsmultimodal chatbotstransportation ai systems