Special-Character Adversarial Attacks on Open-Source Language Model
Published on arXiv
2508.14070
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
All seven evaluated open-source LLMs (3.8B–32B parameters) exhibit critical vulnerabilities to character-level attacks, producing successful jailbreaks, incoherent outputs, and unrelated hallucinations across all model sizes.
Special-Character Adversarial Attacks
Novel technique introduced
Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.
Key Contributions
- Taxonomy and systematic evaluation of four character-level attack families (unicode, homoglyph, structural, encoding obfuscation) against LLM safety mechanisms
- Empirical study of 4,000+ attack attempts across seven open-source LLMs ranging from 3.8B to 32B parameters, revealing universal vulnerabilities at all model sizes
- Public release of experimental code, attack datasets, and evaluation protocols to facilitate reproducible research and defense development