Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model's internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection via a Safety Pattern Auto-Encoder. Extensive experiments demonstrate that LoD consistently achieves state-of-the-art detection performance (AUROC) across diverse unseen jailbreak attacks on multiple LVLMs, while also significantly improving efficiency. Code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Key Contributions

Multi-modal Safety Concept Activation Vectors (MSCAV) classifiers that extract layer-wise safety representations from LVLM internal activations without requiring attack data
Safety Pattern Auto-Encoder (SPAE) that converts high-dimensional safety representations into a 1D anomaly score using unsupervised reconstruction for attack detection
LoD framework achieves state-of-the-art AUROC against unseen jailbreak attacks on multiple LVLMs with improved computational efficiency over learning-free baselines

🛡️ Threat Analysis

Input Manipulation Attack

The paper defends against adversarial visual perturbations on images that bypass LVLM safety filters — this is the adversarial input manipulation threat at inference time. The dual ML01+LLM01 tagging rule applies: adversarial visual inputs to VLMs trigger ML01 for the image perturbation angle.

Details

Domains

visionnlpmultimodal

Model Types

vlmllmtransformer

Threat Tags

inference_timeblack_box

Datasets

AdvBench

Applications

2026 0 cit.

Input Manipulation Attack

86%

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

DefenSee: Dissecting Threat from Sight and Text -- A Multi-View Defensive Pipeline for Multi-modal Jailbreaks

Directional Embedding Smoothing for Robust Vision Language Models

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Reimagining Safety Alignment with An Image

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Robust Multimodal Safety via Conditional Decoding