attack 2026

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Chengyin Hu , Jiaju Han , Xuemeng Sun , Qike Zhang , Yiwei Wei , Ang Li , Chunlei Meng , Xiang Chen , Jiahuan Long

0 citations

Published on arXiv

2603.28568

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, with caption consistency decreasing by up to 58.60 points and VQA correctness by up to 44.38 points, using perturbations on only 1.76% of pixels

XSPA (X-shaped Sparse Pixel Attack)

Novel technique introduced

Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.

Key Contributions

Proposes XSPA, a sparse adversarial attack confined to fixed X-shaped diagonal lines modifying only ~1.76% of pixels
Demonstrates transferable cross-task semantic disruption: 52-67 point drops in zero-shot accuracy, up to 58.60 point caption consistency degradation, up to 44.38 point VQA correctness drop
Reveals vulnerability of VLMs to highly constrained geometric perturbations, showing shared embedding spaces propagate visual attacks across classification, captioning, and VQA

🛡️ Threat Analysis

Prompt Injection

The attack targets vision-language models and explicitly aims to induce semantic failures in LLM-based tasks (image captioning, VQA), manipulating multimodal behavior through visual adversarial inputs.

Input Manipulation Attack

XSPA is a gradient-based adversarial perturbation attack that crafts imperceptible sparse visual perturbations to cause misclassification and semantic drift at inference time across multiple VLM tasks.

Details

Domains

visionnlpmultimodal

Model Types

vlmtransformermultimodal

Threat Tags

white_boxinference_timetargeteddigital

Datasets

COCO

Applications

zero-shot classificationimage captioningvisual question answering

Read PDF arXiv

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

CaptionFool: Universal Image Captioning Model Attacks

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations

HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

Adversarial attacks against Modern Vision-Language Models