attack 2026

VENOMREC: Cross-Modal Interactive Poisoning for Targeted Promotion in Multimodal LLM Recommender Systems

Guowei Guan 1, Yurong Hao 1, Jiaming Zhang 1, Tiantong Wu 1, Fuyao Zhang 1, Tianxiang Chen 1, Longtao Huang 1,2, Cyril Leung 1,2, Wei Yang Bryan Lim 1

0 citations · 53 references · arXiv (Cornell University)

α

Published on arXiv

2602.06409

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

VENOMREC achieves 0.73 mean ER@20 across three real-world multimodal datasets, surpassing the strongest baseline by +0.52 absolute ER@20 points while maintaining comparable recommendation utility.

VENOMREC

Novel technique introduced


Multimodal large language models (MLLMs) are pushing recommender systems (RecSys) toward content-grounded retrieval and ranking via cross-modal fusion. We find that while cross-modal consensus often mitigates conventional poisoning that manipulates interaction logs or perturbs a single modality, it also introduces a new attack surface where synchronised multimodal poisoning can reliably steer fused representations along stable semantic directions during fine-tuning. To characterise this threat, we formalise cross-modal interactive poisoning and propose VENOMREC, which performs Exposure Alignment to identify high-exposure regions in the joint embedding space and Cross-modal Interactive Perturbation to craft attention-guided coupled token-patch edits. Experiments on three real-world multimodal datasets demonstrate that VENOMREC consistently outperforms strong baselines, achieving 0.73 mean ER@20 and improving over the strongest baseline by +0.52 absolute ER points on average, while maintaining comparable recommendation utility.


Key Contributions

  • First formalization of cross-modal interactive poisoning as a distinct threat against MLLM-based recommender systems, showing that cross-modal consensus — while suppressing single-modality noise — creates a new amplification surface for synchronized attacks.
  • Exposure Alignment (EA) technique that identifies high-exposure 'hotspot' regions in the joint embedding space to set the attack's optimization target.
  • Cross-modal Interactive Perturbation (CIP) algorithm that leverages cross-modal attention to identify salient token-patch pairs and crafts coupled, stealthy perturbations achieving 0.73 mean ER@20, outperforming the best baseline by +0.52 absolute ER points.

🛡️ Threat Analysis

Data Poisoning Attack

VENOMREC corrupts training/fine-tuning data by injecting crafted multimodal (text + image) poisoned samples. When the victim model fine-tunes on the poisoned dataset, its fused representations are steered toward a target item's 'hotspot' direction, increasing its recommendation probability — a textbook targeted data poisoning attack.


Details

Domains
multimodalnlpvision
Model Types
llmvlmmultimodal
Threat Tags
training_timetargeted
Applications
multimodal recommender systemscontent-based retrieval and ranking