attack 2025

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

0 citations

Published on arXiv

2508.09456

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

IAG achieves the best attack success rate (ASR@0.5 exceeding 65% on InternVL-2.5-8B) across nearly all benchmark settings while preserving clean grounding accuracy and remaining robust against existing defenses.

IAG

Novel technique introduced

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

Key Contributions

IAG: the first multi-target, input-aware backdoor attack on VLM-based visual grounding, using a text-conditioned UNet to dynamically generate trigger patterns conditioned on attacker-specified target object descriptions.
A joint training objective that balances language capability with perceptual reconstruction to ensure trigger imperceptibility while maintaining clean accuracy.
Comprehensive evaluation across LLaVA, InternVL, Ferret and five grounding benchmarks demonstrating superiority over baselines, robustness to existing defenses, and cross-model/dataset transferability.

🛡️ Threat Analysis

Model Poisoning

IAG is a trigger-based backdoor attack: a text-conditioned UNet generates imperceptible visual triggers at training time that cause VLMs to ground attacker-specified objects instead of the queried ones, while behaving normally on clean inputs — the defining hallmark of a trojan/backdoor attack.

Details

Domains

visionmultimodalnlp

Model Types

vlmllm

Threat Tags

training_timetargeteddigital

Datasets

RefCOCORefCOCO+RefCOCOgFlickr30k EntitiesShowUI

Applications

visual groundingvision-language modelsgui interaction agentsrobotics

Read PDF arXiv Code

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models

Concept-Guided Backdoor Attack on Vision Language Models

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

BADTV: Unveiling Backdoor Threats in Third-Party Task Vectors

Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models