Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

Key Contributions

Systematic evaluation of typographic attacks across four VLMs under 12 font sizes and 10 visual transformations, revealing threshold ASR patterns (near-zero at 6px, plateau at 10-12px)
Discovery that text attacks are 2-5x more effective than image attacks for GPT-4o and Claude, while Qwen3-VL and Mistral show comparable ASR across modalities
Demonstration that multimodal embedding distance (JinaCLIP, Qwen3-VL-Embedding) strongly predicts attack success across all models (r=-0.71 to -0.93, p<0.01), offering model-agnostic detection signal

🛡️ Threat Analysis

Input Manipulation Attack

Adversarial visual inputs (text rendered as images with transformations like rotation, blur, noise) manipulated to evade safety mechanisms and cause harmful outputs at inference time.

Details

Domains

multimodalvisionnlp

Model Types

vlmmultimodaltransformer

Threat Tags

black_boxinference_timedigitalphysical

Datasets

SALAD-Bench

Applications

2025 0 cit.

Input Manipulation Attack

86%

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Adversarial Prompt Injection Attack on Multimodal Large Language Models

Cross-Modal Content Optimization for Steering Web Agent Preferences

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models