attack 2026

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li 1,2, Zongyang Ma 1,2, Yuntong Pan 3, Ziqi Zhang 1,2, Xiaolei Lv 4, Bo Li 4, Jun Gao 4, Jianing Zhang 5, Chunfeng Yuan 1,2, Bing Li 1,2, Weiming Hu 1,2,5,6

0 citations

α

Published on arXiv

2604.06950

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves >90% attack success rate against state-of-the-art MLLMs including GPT-5 and Gemini 2.5 Pro on SmuggleBench

Adversarial Smuggling Attacks (ASA)

Novel technique introduced


Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.


Key Contributions

  • Identifies and formalizes Adversarial Smuggling Attacks (ASA) as a distinct threat class exploiting human-AI capability gaps in content moderation
  • Constructs SmuggleBench, the first comprehensive benchmark with 1,700 adversarial smuggling instances across 9 distinct techniques
  • Demonstrates >90% attack success rates against SOTA MLLMs (GPT-5, Gemini 2.5 Pro, Qwen3-VL) and identifies three root causes of vulnerability

🛡️ Threat Analysis

Prompt Injection

The paper targets MLLM-based content moderation systems, aiming to bypass safety guardrails and enable dissemination of harmful content (hate speech, violence, extremism). The Reasoning Blockade pathway specifically manipulates semantic understanding to evade threat detection, which aligns with LLM safety/jailbreaking concerns. This is a multimodal attack targeting VLM safety mechanisms.

Input Manipulation Attack

Adversarial smuggling attacks craft visual inputs that cause MLLM content moderators to misclassify harmful content as benign at inference time. The attack operates via two pathways: (1) Perceptual Blindness disrupts text recognition (visual adversarial manipulation), and (2) Reasoning Blockade causes semantic misinterpretation. This is an inference-time input manipulation attack causing misclassification, which is the core of ML01.


Details

Domains
multimodalvisionnlp
Model Types
vlmmultimodaltransformer
Threat Tags
black_boxinference_timetargeteddigital
Datasets
SmuggleBench
Applications
content moderationsocial media filteringautomated censorship