attack 2026

Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

Xinwei Zhang ¹, Li Bai ¹, Tianwei Zhang ², Youqian Zhang ¹, Qingqing Ye ¹, Yingnan Zhao ³, Ruochen Du ³, Haibo Hu ¹

¹ The Hong Kong Polytechnic University

² Nanyang Technological University

³ Harbin Engineering University

0 citations · 69 references · arXiv (Cornell University)

Published on arXiv

2602.09431

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SGMA achieves substantially higher adversarial transferability than existing encoder-based attacks across 8 diverse LVLM architectures in zero-query black-box scenarios.

SGMA (Semantic-Guided Multimodal Attack)

Novel technique introduced

Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.

Key Contributions

First systematic large-scale benchmark of encoder-based adversarial transferability across 8 diverse LVLMs in zero-query black-box settings, revealing severely limited transferability of existing attacks.
Root-cause analysis identifying two transferability bottlenecks: inconsistent visual grounding across LVLM architectures and redundant semantic alignment (object dispersion) within models.
SGMA (Semantic-Guided Multimodal Attack), which improves black-box transfer by directing perturbations toward semantically stable regions and disrupting cross-modal grounding at both global and local token levels.

🛡️ Threat Analysis

Input Manipulation Attack

SGMA crafts gradient-optimized adversarial perturbations on visual inputs at inference time to cause incorrect outputs across diverse LVLM architectures — a classic adversarial example / evasion attack.

Details

Domains

visionmultimodalnlp

Model Types

vlmtransformer

Threat Tags

black_boxinference_timedigital

Applications

vision-language modelsvisual question answeringmultimodal reasoning

Read PDF arXiv DOI

Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

Cross-Modal Content Optimization for Steering Web Agent Preferences

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Enhancing Targeted Adversarial Attacks on Large Vision-Language Models via Intermediate Projector

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition

Adversarial Prompt Injection Attack on Multimodal Large Language Models

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting