One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations
Ravikumar Balakrishnan , Sanket Mendapara
Published on arXiv
2604.25102
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Embedding distance correlates strongly with attack success rate (r=-0.71 to -0.93, p<0.01); adversarial optimization simultaneously recovers readability and reduces safety refusals across GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL
CWA-SSA (Carlini-Wagner Attack with Sign-Stochastic Ascent)
Novel technique introduced
Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{<}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.
Key Contributions
- Empirical finding that multimodal embedding distance predicts attack success rate (r=-0.71 to -0.93) across VLMs
- Embedding-guided adversarial perturbation method (CWA-SSA) that optimizes typographic rendering to bypass safety filters
- Identification of two distinct failure modes: perceptual readability recovery and safety alignment bypass, with dominant mechanism varying by model
🛡️ Threat Analysis
Uses bounded l-infinity adversarial perturbations (CWA-SSA optimization) on visual inputs to manipulate VLM behavior - this is a gradient-based adversarial attack on the vision component of VLMs.