α

Published on arXiv

2508.20570

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Dyslexify improves robustness to typographic attacks by up to 22.06% on a typographic variant of ImageNet-100 while reducing standard ImageNet-100 accuracy by less than 1%, without requiring any gradient-based fine-tuning.

Dyslexify

Novel technique introduced


Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.


Key Contributions

  • Introduces the Typographic Attention Score to locate specialized attention heads in CLIP that causally transmit typographic information to the cls token via mechanistic interpretability.
  • Proposes Dyslexify, a gradient-free defense that ablates a typographic circuit of attention heads, improving robustness by up to 22.06% on typographic ImageNet-100 with less than 1% clean accuracy loss.
  • Releases a family of dyslexic CLIP models as drop-in replacements for safety-critical applications, validated on medical (skin lesion) and zero-shot classification tasks.

🛡️ Threat Analysis

Input Manipulation Attack

Typographic attacks craft adversarial inputs (images with injected text) that cause targeted misclassification at inference time. The paper's primary contribution (Dyslexify) is a defense against this input manipulation attack via mechanistic circuit ablation of attention heads responsible for processing adversarial typographic content.


Details

Domains
visionmultimodal
Model Types
vlmtransformermultimodal
Threat Tags
inference_timetargeteddigital
Datasets
ImageNet-100typographic ImageNet-100
Applications
zero-shot image classificationmedical imagingcontent moderationvision-language models