attack 2025

GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Aryan Yazdan Parast 1, Parsa Hosseini 2, Hesam Asadollahzadeh 1, Arshia Soltani Moakhar 1, Basim Azam 2, Soheil Feizi 1, Naveed Akhtar 2

0 citations · 57 references · arXiv

α

Published on arXiv

2509.25178

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

GHOST achieves over 28% hallucination success rate across MLLMs compared to ~1% for prior methods, with 66.5% cross-model transferability from Qwen2.5-VL to GPT-4o.

GHOST (Generating Hallucinations via Optimizing Stealth Tokens)

Novel technique introduced


Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.


Key Contributions

  • GHOST: an automatic embedding-space optimization method that generates natural-looking adversarial images causing MLLMs to hallucinate absent objects, without human supervision
  • Achieves 28%+ hallucination success rate vs. ~1% for prior data-driven methods, with demonstrated transferability (66.5% cross-model transfer from Qwen2.5-VL to GPT-4o)
  • Shows fine-tuning on GHOST-generated images mitigates hallucination, positioning the attack as both a diagnostic stress-test and a data augmentation defense

🛡️ Threat Analysis

Input Manipulation Attack

GHOST optimizes adversarial images in the image embedding space of VLMs to cause misperception (hallucination of absent objects) at inference time — this is a gradient-based adversarial input manipulation attack on vision-language models, fitting ML01's core definition of crafting inputs that cause incorrect outputs.


Details

Domains
visionnlpmultimodalgenerative
Model Types
vlmdiffusionllm
Threat Tags
white_boxblack_boxinference_timetargeteddigital
Datasets
Evaluated on GPT-4o, Qwen2.5-VL, GLM-4.1V-Thinking outputs; standard MLLM hallucination benchmarks implied
Applications
multimodal llmsvisual question answeringobject hallucination benchmarking