CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks
Xu Zhang 1, Hao Li 2, Zhichao Lu 1
Published on arXiv
2510.17687
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
CrossGuard maintains consistently low attack success rates on both explicit and implicit multimodal jailbreak benchmarks, outperforming GPT-4o, LlamaGuard-Vision, and HiddenDetect across all evaluated settings
CrossGuard / ImpForge
Novel technique introduced
Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.
Key Contributions
- ImpForge: an RL-based automated red-teaming pipeline with safety, semantic, and overlap reward modules that generates diverse high-quality implicit joint-modal malicious samples across 14 domains
- CrossGuard: a LoRA-fine-tuned LLaVA-based intent-aware safeguard that defends against both implicit (joint-modal) and explicit (text- and vision-based) MLLM jailbreaks
- Comprehensive evaluation across safe/unsafe benchmarks and out-of-domain settings showing CrossGuard outperforms advanced MLLMs and existing guardrails while maintaining high utility
🛡️ Threat Analysis
CrossGuard explicitly defends against adversarial visual perturbation-based (perturbation-based jailbreak) attacks on VLMs, which is the adversarial VISUAL inputs to VLMs scenario that warrants ML01 co-tagging alongside LLM01.