attack 2026

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du 1,2, Chenxiao Yu 3, Haoyan Xu 3, Ziyi Wang 4, Yue Zhao 3, Xiyang Hu 3

0 citations · 12 references · arXiv

α

Published on arXiv

2601.12263

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Joint multimodal attack substantially outperforms unimodal (text-only or image-only) baselines and prompt-based generative baselines in elevating target product rank in VLM-based search.

MGEO (Multimodal Generative Engine Optimization)

Novel technique introduced


Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.


Key Contributions

  • First formulation of multimodal ranking attacks on VLM-based rerankers, modeling a realistic adversary who modifies only their own product listing under stealth constraints.
  • MGEO framework integrating PGD-based imperceptible image perturbation with gradient-based soft embedding optimization for fluent adversarial text suffixes.
  • Alternating optimization algorithm that exploits cross-modal coupling in VLMs, substantially outperforming text-only, image-only, and generative heuristic baselines.

🛡️ Threat Analysis

Input Manipulation Attack

MGEO uses PGD-based adversarial image perturbations and gradient-based soft prompt optimization for adversarial text suffixes — both are gradient-based input manipulation attacks at inference time targeting a VLM ranker.


Details

Domains
visionnlpmultimodal
Model Types
vlmtransformer
Threat Tags
white_boxinference_timetargeteddigital
Applications
product searche-commerce rankingmultimodal recommendation systems