defense 2025

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Shuang Liang 1, Zhihao Xu 1, Jiaqi Weng 2, Jialing Tao 2, Hui Xue 2, Xiting Wang

0 citations

α

Published on arXiv

2508.09201

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LoD achieves state-of-the-art AUROC on diverse unseen jailbreak attacks across multiple LVLMs without requiring any attack data or hand-crafted heuristics during training

Learning to Detect (LoD) with MSCAV + SPAE

Novel technique introduced


Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model's internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection via a Safety Pattern Auto-Encoder. Extensive experiments demonstrate that LoD consistently achieves state-of-the-art detection performance (AUROC) across diverse unseen jailbreak attacks on multiple LVLMs, while also significantly improving efficiency. Code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.


Key Contributions

  • Multi-modal Safety Concept Activation Vectors (MSCAV) classifiers that extract layer-wise safety representations from LVLM internal activations without requiring attack data
  • Safety Pattern Auto-Encoder (SPAE) that converts high-dimensional safety representations into a 1D anomaly score using unsupervised reconstruction for attack detection
  • LoD framework achieves state-of-the-art AUROC against unseen jailbreak attacks on multiple LVLMs with improved computational efficiency over learning-free baselines

🛡️ Threat Analysis

Input Manipulation Attack

The paper defends against adversarial visual perturbations on images that bypass LVLM safety filters — this is the adversarial input manipulation threat at inference time. The dual ML01+LLM01 tagging rule applies: adversarial visual inputs to VLMs trigger ML01 for the image perturbation angle.


Details

Domains
visionnlpmultimodal
Model Types
vlmllmtransformer
Threat Tags
inference_timeblack_box
Datasets
AdvBench
Applications
large vision-language model safetyjailbreak detectionmultimodal ai safety