Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48\% ASR against Qwen3-VL-Plus, scaling to 90\% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

Key Contributions

MemJack framework using multi-agent coordination and visual-semantic mapping to automate VLM jailbreaks
Iterative Nullspace Projection (INLP) geometric filter to bypass latent space refusals
Multimodal Experience Memory for transferring successful attack strategies across images and multi-turn interactions
MemJack-Bench dataset with 113,000+ multimodal jailbreak trajectories for future defense research

🛡️ Threat Analysis

Input Manipulation Attack

Uses adversarial visual inputs (semantic manipulation of original images) to cause VLMs to produce harmful outputs at inference time.

Details

Domains

multimodalvisionnlp

Model Types

vlmmultimodaltransformer

Threat Tags

black_boxinference_timetargeteddigital

Datasets

COCO val2017MemJack-Bench

Applications

2026 0 cit.

Input Manipulation Attack

90%

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Cross-Modal Content Optimization for Steering Web Agent Preferences

Adversarial Prompt Injection Attack on Multimodal Large Language Models

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition