Special-Character Adversarial Attacks on Open-Source Language Model

Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.

Key Contributions

Taxonomy and systematic evaluation of four character-level attack families (unicode, homoglyph, structural, encoding obfuscation) against LLM safety mechanisms
Empirical study of 4,000+ attack attempts across seven open-source LLMs ranging from 3.8B to 32B parameters, revealing universal vulnerabilities at all model sizes
Public release of experimental code, attack datasets, and evaluation protocols to facilitate reproducible research and defense development