defense 2025

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

Xu Zhang 1, Hao Li 2, Zhichao Lu 1

0 citations · 45 references · arXiv

α

Published on arXiv

2510.17687

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CrossGuard maintains consistently low attack success rates on both explicit and implicit multimodal jailbreak benchmarks, outperforming GPT-4o, LlamaGuard-Vision, and HiddenDetect across all evaluated settings

CrossGuard / ImpForge

Novel technique introduced


Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.


Key Contributions

  • ImpForge: an RL-based automated red-teaming pipeline with safety, semantic, and overlap reward modules that generates diverse high-quality implicit joint-modal malicious samples across 14 domains
  • CrossGuard: a LoRA-fine-tuned LLaVA-based intent-aware safeguard that defends against both implicit (joint-modal) and explicit (text- and vision-based) MLLM jailbreaks
  • Comprehensive evaluation across safe/unsafe benchmarks and out-of-domain settings showing CrossGuard outperforms advanced MLLMs and existing guardrails while maintaining high utility

🛡️ Threat Analysis

Input Manipulation Attack

CrossGuard explicitly defends against adversarial visual perturbation-based (perturbation-based jailbreak) attacks on VLMs, which is the adversarial VISUAL inputs to VLMs scenario that warrants ML01 co-tagging alongside LLM01.


Details

Domains
multimodalnlpvision
Model Types
vlmllmmultimodal
Threat Tags
inference_timeblack_boxtraining_time
Datasets
SIUOVLGuardJailBreakVMMBenchMMSafetyBenchFigStep
Applications
multimodal large language modelsvlm safety guardrailsjailbreak defense