attack 2025

Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

Yangshijie Zhang 1, Xinda Wang 2, Jialin Liu 2, Wenqiang Wang 3, Zhicong Ma 1, Xingxing Jia 1

0 citations · 38 references · arXiv

α

Published on arXiv

2510.19641

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SAD achieves strong adversarial attack performance across traditional NLP models, LLMs, and commercial services while preserving human readability through stylistic font substitution.

SAD (Style Attack Disguise)

Novel technique introduced


With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD's strong attack performance. We also show SAD's potential threats to multimodal tasks including text-to-image and text-to-speech generation.


Key Contributions

  • Introduces SAD, a style-level adversarial attack exploiting Unicode stylistic fonts (mathematical alphabets, regional indicator symbols, squared letters) to create human-readable but model-confusing text
  • Develops a hybrid word ranking method combining Attention-based Importance Score (AIS) and Tokenization Instability Score (TIS) to prioritize attack targets
  • Demonstrates effectiveness across WordPiece, BPE, and LLM architectures on sentiment classification, machine translation, text-to-image, and text-to-speech tasks

🛡️ Threat Analysis

Input Manipulation Attack

SAD crafts adversarial text inputs by substituting standard characters with Unicode stylistic font equivalents, causing misclassification and degraded outputs at inference time across sentiment classifiers, MT models, and LLMs — a classic input manipulation/evasion attack exploiting the human-model perception gap in tokenization.


Details

Domains
nlpmultimodal
Model Types
llmtransformertraditional_ml
Threat Tags
black_boxinference_time
Applications
sentiment classificationmachine translationtext-to-image generationtext-to-speech generation