Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Shuang Liang , Zhihao Xu , Jialing Tao , Hui Xue , Xiting Wang
Published on arXiv
2510.15430
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves consistently higher detection AUROC than existing methods on diverse unknown jailbreak attacks while improving computational efficiency
LoD (Learning to Detect)
Novel technique introduced
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
Key Contributions
- Learning to Detect (LoD) framework that shifts from attack-specific to task-specific learning, enabling generalization to unseen jailbreak attacks
- Multi-modal Safety Concept Activation Vector (MSCAV) module for safety-oriented multimodal representation learning
- Safety Pattern Auto-Encoder (SPAE) module for unsupervised attack classification without prior knowledge of attack types