Security Risk of Misalignment between Text and Image in Multi-modal Model
Xiaosen Wang 1, Zhijin Ge 2, Shaokang Wang 3
Published on arXiv
2510.26105
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
PReMA successfully manipulates multi-modal diffusion model outputs to generate NSFW content via adversarial images alone, bypassing existing prompt-focused safety defenses on image inpainting and style transfer tasks.
PReMA (Prompt-Restricted Multi-modal Attack)
Novel technique introduced
Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
Key Contributions
- Identifies and exploits a misalignment between text and image modalities in diffusion models that enables content manipulation inconsistent with the input prompt.
- Proposes PReMA, the first attack that generates NSFW content by crafting adversarial images alone — without modifying the prompt — rendering prompt-based defenses ineffective.
- Demonstrates efficacy of PReMA on image inpainting and style transfer tasks across multiple diffusion model variants.
🛡️ Threat Analysis
PReMA creates adversarial perturbations to input images at inference time, causing diffusion models to generate manipulated or NSFW content — a direct gradient-based adversarial input manipulation attack. The adversarial examples are images, not prompts.