attack 2025

Concept-Guided Backdoor Attack on Vision Language Models

Haoyu Shen , Weimin Lyu , Haotian Xu , Tengfei Ma

0 citations · 37 references · arXiv

α

Published on arXiv

2512.00713

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Both CTP and CGUB achieve high attack success rates with moderate clean-task performance degradation across multiple VLM architectures (BLIP-2, LLaVA, Qwen2.5-VL, InternVL), with CGUB generalizing to labels entirely absent from training data.

CTP / CGUB (Concept-Thresholding Poisoning / CBL-Guided Unseen Backdoor)

Novel technique introduced


Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference time to keep the VLM unchanged. This design enables systematic replacement of a targeted label in generated text (for example, replacing "cat" with "dog"), even when the replacement behavior never appears in the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates while maintaining moderate impact on clean-task performance. These findings highlight concept-level vulnerabilities as a critical new attack surface for VLMs.


Key Contributions

  • First systematic study of concept-guided backdoor attacks on VLMs that operate at semantic concept level without modifying raw image pixels
  • CTP: uses explicit visual concepts as semantic triggers — only poisoning samples containing the target concept, making the backdoor invisible to image-based defenses
  • CGUB: uses a Concept Bottleneck Model to intervene on internal concept activations during training and discards it at inference, enabling backdoor generalization to labels unseen during poisoning

🛡️ Threat Analysis

Model Poisoning

Both CTP and CGUB embed hidden, targeted malicious behavior in VLMs triggered by semantic concepts — the model behaves normally otherwise but consistently produces malicious outputs when the trigger concept appears. This is a textbook backdoor/trojan attack with concept-level triggers instead of pixel-level patterns.


Details

Domains
visionnlpmultimodal
Model Types
vlmmultimodal
Threat Tags
training_timetargeteddigital
Applications
image captioningvisual question answering