attack 2025

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

Yijun Yang ¹, Lichao Wang ², Jianping Zhang ¹, Chi Harold Liu ², Lanqing Hong ³, Qiang Xu ¹

¹ The Chinese University of Hong Kong

² Beijing Institute of Technology

³ Huawei

0 citations · 38 references · arXiv

Published on arXiv

2511.16110

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MFA achieves 58.5% overall jailbreak success rate across defense-equipped VLMs and 52.8% on commercial models (GPT-4o, Gemini-Pro, Llama-4), outperforming the second-best attack by 34%

Multi-Faceted Attack (MFA) / Attention-Transfer Attack (ATA)

Novel technique introduced

The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack

Key Contributions

Attention-Transfer Attack (ATA) that hides harmful instructions inside a meta task with competing objectives, explained theoretically via reward hacking
Lightweight transfer-enhancement algorithm combined with a repetition strategy that jointly evades input-level and output-level filters without model-specific fine-tuning
Empirical demonstration that adversarial images optimized on one vision encoder broadly transfer to unseen VLMs, revealing cross-model safety vulnerabilities from shared visual representations

🛡️ Threat Analysis

Input Manipulation Attack

Core contribution (Attention-Transfer Attack) crafts adversarial images via gradient-based optimization against a vision encoder, with perturbations that transfer cross-model to unseen VLMs — a direct adversarial input manipulation attack at inference time.

Details

Domains

visionmultimodalnlp

Model Types

vlmmultimodaltransformer

Threat Tags

white_boxblack_boxinference_timetargeteddigital

Applications

vision-language modelscontent safety systemsmultimodal chatbots

Read PDF arXiv DOI Code

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Cross-Modal Content Optimization for Steering Web Agent Preferences

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Adversarial Prompt Injection Attack on Multimodal Large Language Models

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction