attack 2026

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Ravikumar Balakrishnan , Sanket Mendapara , Ankit Garg

0 citations

α

Published on arXiv

2604.12371

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Text attacks achieve 36% ASR on GPT-4o vs 8% for image attacks; embedding distance correlates r=-0.71 to -0.93 with ASR across all VLMs

Typographic Prompt Injection

Novel technique introduced


We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.


Key Contributions

  • Systematic evaluation of typographic attacks across four VLMs under 12 font sizes and 10 visual transformations, revealing threshold ASR patterns (near-zero at 6px, plateau at 10-12px)
  • Discovery that text attacks are 2-5x more effective than image attacks for GPT-4o and Claude, while Qwen3-VL and Mistral show comparable ASR across modalities
  • Demonstration that multimodal embedding distance (JinaCLIP, Qwen3-VL-Embedding) strongly predicts attack success across all models (r=-0.71 to -0.93, p<0.01), offering model-agnostic detection signal

🛡️ Threat Analysis

Input Manipulation Attack

Adversarial visual inputs (text rendered as images with transformations like rotation, blur, noise) manipulated to evade safety mechanisms and cause harmful outputs at inference time.


Details

Domains
multimodalvisionnlp
Model Types
vlmmultimodaltransformer
Threat Tags
black_boxinference_timedigitalphysical
Datasets
SALAD-Bench
Applications
autonomous agentsbrowser automationcomputer-use systemsembodied agents