benchmark 2025

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li ^1,2, Yige Li ¹, Hanxun Huang ¹, Yunhao Chen ², Xin Wang ³, Yixu Wang ¹, Xingjun Ma ¹, Yu-Gang Jiang ¹

¹ Fudan University

² Singapore Management University

³ The University of Melbourne

0 citations · 44 references · arXiv

Published on arXiv

2511.18921

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Backdoor attacks using textual triggers achieve over 90% attack success rate with poisoning rates as low as 1% across most VLM tasks, and text triggers consistently dominate image triggers in bimodal settings.

BackdoorVLM

Novel technique introduced

Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1\% yielding over 90\% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .

Key Contributions

First comprehensive benchmark (BackdoorVLM) for systematically evaluating backdoor attacks on VLMs across image captioning and VQA tasks
Taxonomy of 5 multimodal backdoor threat categories and evaluation of 12 representative attack methods spanning text, image, and bimodal triggers on 2 open-source VLMs
Empirical finding that VLMs exhibit strong sensitivity to textual triggers, with text triggers dominating bimodal backdoors and 1% poisoning yielding >90% attack success across most tasks

🛡️ Threat Analysis

Model Poisoning

The paper's entire contribution centers on hidden, trigger-activated backdoor behaviors injected into VLMs — covering 5 categories (targeted refusal, malicious injection, jailbreak, concept substitution, perceptual hijack) across 12 attack methods. This is the canonical ML10 threat: training-time backdoor injection that produces normal behavior until a specific trigger activates the malicious pathway.

Details

Domains

visionnlpmultimodal

Model Types

vlmmultimodal

Threat Tags

training_timetargeteddigital

Datasets

3 multimodal datasets (unspecified in abstract)

Applications

image captioningvisual question answering

Read PDF arXiv DOI Code

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Concept-Guided Backdoor Attack on Vision Language Models

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models

Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Test-Time Attention Purification for Backdoored Large Vision Language Models