Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

Key Contributions

Identifies and formalizes Adversarial Smuggling Attacks (ASA) as a distinct threat class exploiting human-AI capability gaps in content moderation
Constructs SmuggleBench, the first comprehensive benchmark with 1,700 adversarial smuggling instances across 9 distinct techniques
Demonstrates >90% attack success rates against SOTA MLLMs (GPT-5, Gemini 2.5 Pro, Qwen3-VL) and identifies three root causes of vulnerability

🛡️ Threat Analysis

Prompt Injection

The paper targets MLLM-based content moderation systems, aiming to bypass safety guardrails and enable dissemination of harmful content (hate speech, violence, extremism). The Reasoning Blockade pathway specifically manipulates semantic understanding to evade threat detection, which aligns with LLM safety/jailbreaking concerns. This is a multimodal attack targeting VLM safety mechanisms.

Input Manipulation Attack

Adversarial smuggling attacks craft visual inputs that cause MLLM content moderators to misclassify harmful content as benign at inference time. The attack operates via two pathways: (1) Perceptual Blindness disrupts text recognition (visual adversarial manipulation), and (2) Reasoning Blockade causes semantic misinterpretation. This is an inference-time input manipulation attack causing misclassification, which is the core of ML01.

Details

Domains

multimodalvisionnlp

Model Types

vlmmultimodaltransformer

Threat Tags

black_boxinference_timetargeteddigital

Datasets

SmuggleBench

Applications

content moderationsocial media filteringautomated censorship

Read PDF arXiv Code

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Prompt Injection Attack on Multimodal Large Language Models

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Cross-Modal Content Optimization for Steering Web Agent Preferences

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization