Enhancing Targeted Adversarial Attacks on Large Vision-Language Models via Intermediate Projector

The growing deployment of Large Vision-Language Models (VLMs) raises safety concerns, as adversaries may exploit model vulnerabilities to induce harmful outputs, with targeted black-box adversarial attacks posing a particularly severe threat. However, existing methods primarily maximize encoder-level global similarity, which lacks the granularity for stealthy and practical fine-grained attacks, where only specific target should be altered (e.g., modifying a car while preserving its background). Moreover, they largely neglect the projector, a key semantic bridge in VLMs for multimodal alignment. To address these limitations, we propose a novel black-box targeted attack framework that leverages the projector. Specifically, we utilize the widely adopted Querying Transformer (Q-Former) which transforms global image embeddings into fine-grained query outputs, to enhance attack effectiveness and granularity. For standard global targeted attack scenarios, we propose the Intermediate Projector Guided Attack (IPGA), which aligns Q-Former fine-grained query outputs with the target to enhance attack strength and exploits the intermediate pretrained Q-Former that is not fine-tuned for any specific Large Language Model (LLM) to improve attack transferability. For fine-grained attack scenarios, we augment IPGA with the Residual Query Alignment (RQA) module, which preserves unrelated content by constraining non-target query outputs to enhance attack granularity. Extensive experiments demonstrate that IPGA significantly outperforms baselines in global targeted attacks, and IPGA with RQA (IPGA-R) attains superior success rates and unrelated content preservation over baselines in fine-grained attacks. Our method also transfers effectively to commercial VLMs such as Google Gemini and OpenAI GPT.

Key Contributions

IPGA (Intermediate Projector Guided Attack): leverages intermediate Q-Former outputs to improve targeted attack strength and cross-model transferability for global attack scenarios
IPGA-R: augments IPGA with Residual Query Alignment (RQA) module that preserves non-target content by constraining non-target query outputs, enabling fine-grained targeted attacks
Demonstrated transfer effectiveness to closed-source commercial VLMs including Google Gemini and OpenAI GPT

🛡️ Threat Analysis

Input Manipulation Attack

Core contribution is crafting adversarial visual perturbations that manipulate VLM outputs at inference time — gradient-based optimization of image inputs to induce targeted misinterpretations, with both global and fine-grained attack scenarios.