attack 2025

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Yifan Lan 1, Yuanpu Cao 1, Weitong Zhang 2, Lu Lin 1, Jinghui Chen 1

0 citations

α

Published on arXiv

2509.12521

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Adversarially optimized images successfully redirect MLLM response preferences toward attacker-specified targets across diverse tasks while generating contextually plausible responses that evade detection.

Phi (Preference Hijacking)

Novel technique introduced


Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.


Key Contributions

  • Preference Hijacking (Phi): a gradient-based adversarial attack that crafts images to arbitrarily manipulate MLLM output preferences at inference time without model modification
  • Universal hijacking perturbation: a transferable adversarial component embeddable into arbitrary images to steer MLLM responses toward attacker-specified preferences
  • Demonstrates manipulation of diverse MLLM preferences including opinions, personality traits, and hallucination induction across multiple models and tasks

🛡️ Threat Analysis

Input Manipulation Attack

Phi crafts adversarial image perturbations via gradient-based optimization to manipulate MLLM outputs at inference time — a direct adversarial input manipulation attack on vision-language models.


Details

Domains
visionnlpmultimodal
Model Types
vlmllmmultimodal
Threat Tags
white_boxinference_timetargeteddigital
Applications
multimodal ai assistantsvision-language modelssocial media platforms