Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Ravikumar Balakrishnan , Sanket Mendapara , Ankit Garg
Published on arXiv
2604.12371
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Text attacks achieve 36% ASR on GPT-4o vs 8% for image attacks; embedding distance correlates r=-0.71 to -0.93 with ASR across all VLMs
Typographic Prompt Injection
Novel technique introduced
We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.
Key Contributions
- Systematic evaluation of typographic attacks across four VLMs under 12 font sizes and 10 visual transformations, revealing threshold ASR patterns (near-zero at 6px, plateau at 10-12px)
- Discovery that text attacks are 2-5x more effective than image attacks for GPT-4o and Claude, while Qwen3-VL and Mistral show comparable ASR across modalities
- Demonstration that multimodal embedding distance (JinaCLIP, Qwen3-VL-Embedding) strongly predicts attack success across all models (r=-0.71 to -0.93, p<0.01), offering model-agnostic detection signal
🛡️ Threat Analysis
Adversarial visual inputs (text rendered as images with transformations like rotation, blur, noise) manipulated to evade safety mechanisms and cause harmful outputs at inference time.