Text Prompt Injection of Vision Language Models

The widespread application of large vision language models has significantly raised safety concerns. In this project, we investigate text prompt injection, a simple yet effective method to mislead these models. We developed an algorithm for this type of attack and demonstrated its effectiveness and efficiency through experiments. Compared to other attack methods, our approach is particularly effective for large models without high demand for computational resources.

Key Contributions

Systematic text prompt injection algorithm that embeds adversarial text within images to mislead VLMs, requiring no gradient access
Empirical demonstration that text prompt injection achieves high attack success rates on LLaVA-Next-72B with significantly less GPU compute than gradient-based attacks
Comprehensive analysis of placement and embedding techniques for injected text prompts within images

🛡️ Threat Analysis

Input Manipulation Attack

The attack crafts adversarial visual inputs (images with embedded text overlays functioning as adversarial patches) to manipulate VLM outputs at inference time — consistent with the dual-tagging rule for adversarial visual inputs to VLMs that jailbreak or manipulate their outputs.