Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent
Yangshijie Zhang 1, Xinda Wang 2, Jialin Liu 2, Wenqiang Wang 3, Zhicong Ma 1, Xingxing Jia 1
Published on arXiv
2510.19641
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
SAD achieves strong adversarial attack performance across traditional NLP models, LLMs, and commercial services while preserving human readability through stylistic font substitution.
SAD (Style Attack Disguise)
Novel technique introduced
With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD's strong attack performance. We also show SAD's potential threats to multimodal tasks including text-to-image and text-to-speech generation.
Key Contributions
- Introduces SAD, a style-level adversarial attack exploiting Unicode stylistic fonts (mathematical alphabets, regional indicator symbols, squared letters) to create human-readable but model-confusing text
- Develops a hybrid word ranking method combining Attention-based Importance Score (AIS) and Tokenization Instability Score (TIS) to prioritize attack targets
- Demonstrates effectiveness across WordPiece, BPE, and LLM architectures on sentiment classification, machine translation, text-to-image, and text-to-speech tasks
🛡️ Threat Analysis
SAD crafts adversarial text inputs by substituting standard characters with Unicode stylistic font equivalents, causing misclassification and degraded outputs at inference time across sentiment classifiers, MT models, and LLMs — a classic input manipulation/evasion attack exploiting the human-model perception gap in tokenization.