Cross-Modal Content Optimization for Steering Web Agent Preferences
Tanqiu Jiang 1,2, Min Bai 2, Nikolaos Pappas 2, Yanjun Qi 2, Sandesh Swamy 2
Published on arXiv
2510.03612
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
CPS raises target-item selection rate from a 12.5% random baseline to over 50% while maintaining 70% lower detection rates than leading single-modal baseline methods across all evaluated VLMs.
Cross-Modal Preference Steering (CPS)
Novel technique introduced
Vision-language model (VLM)-based web agents increasingly power high-stakes selection tasks like content recommendation or product ranking by combining multimodal perception with preference reasoning. Recent studies reveal that these agents are vulnerable against attackers who can bias selection outcomes through preference manipulations using adversarial pop-ups, image perturbations, or content tweaks. Existing work, however, either assumes strong white-box access, with limited single-modal perturbations, or uses impractical settings. In this paper, we demonstrate, for the first time, that joint exploitation of visual and textual channels yields significantly more powerful preference manipulations under realistic attacker capabilities. We introduce Cross-Modal Preference Steering (CPS) that jointly optimizes imperceptible modifications to an item's visual and natural language descriptions, exploiting CLIP-transferable image perturbations and RLHF-induced linguistic biases to steer agent decisions. In contrast to prior studies that assume gradient access, or control over webpages, or agent memory, we adopt a realistic black-box threat setup: a non-privileged adversary can edit only their own listing's images and textual metadata, with no insight into the agent's model internals. We evaluate CPS on agents powered by state-of-the-art proprietary and open source VLMs including GPT-4.1, Qwen-2.5VL and Pixtral-Large on both movie selection and e-commerce tasks. Our results show that CPS is significantly more effective than leading baseline methods. For instance, our results show that CPS consistently outperforms baselines across all models while maintaining 70% lower detection rates, demonstrating both effectiveness and stealth. These findings highlight an urgent need for robust defenses as agentic systems play an increasingly consequential role in society.
Key Contributions
- First demonstration that joint cross-modal (visual + textual) exploitation is significantly more effective than single-modal attacks for preference steering in a realistic black-box setting where the adversary controls only their own listing.
- CPS framework combining CLIP-transferable PGD adversarial image perturbations with RLHF-induced linguistic bias exploitation via iterative judge-model-guided text refinement.
- Empirical evaluation on GPT-4.1, Qwen-2.5VL, and Pixtral-Large showing CPS raises target selection from 12.5% baseline to >50% while achieving 70% lower detection rates than baselines.
🛡️ Threat Analysis
Uses CLIP-guided PGD to generate imperceptible adversarial image perturbations that transfer to black-box proprietary VLMs, manipulating their visual perception at inference time — a direct adversarial visual input attack on VLMs.