attack 2026

Emoji-Based Jailbreaking of Large Language Models

M P V S Gopinadh , S Mahaboob Hussain

0 citations · 19 references · arXiv

α

Published on arXiv

2601.00936

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Gemma 2 9B and Mistral 7B showed 10% jailbreak success rates from emoji-based prompts while Qwen 2 7B maintained full alignment (0% success), with inter-model differences confirmed significant at p < 0.001.

Emoji-Based Jailbreaking (emoji stuffing / emoji chaining)

Novel technique introduced


Large Language Models (LLMs) are integral to modern AI applications, but their safety alignment mechanisms can be bypassed through adversarial prompt engineering. This study investigates emoji-based jailbreaking, where emoji sequences are embedded in textual prompts to trigger harmful and unethical outputs from LLMs. We evaluated 50 emoji-based prompts on four open-source LLMs: Mistral 7B, Qwen 2 7B, Gemma 2 9B, and Llama 3 8B. Metrics included jailbreak success rate, safety alignment adherence, and latency, with responses categorized as successful, partial and failed. Results revealed model-specific vulnerabilities: Gemma 2 9B and Mistral 7B exhibited 10 % success rates, while Qwen 2 7B achieved full alignment (0% success). A chi-square test (chi^2 = 32.94, p < 0.001) confirmed significant inter-model differences. While prior works focused on emoji attacks targeting safety judges or classifiers, our empirical analysis examines direct prompt-level vulnerabilities in LLMs. The results reveal limitations in safety mechanisms and highlight the necessity for systematic handling of emoji-based representations in prompt-level safety and alignment pipelines.


Key Contributions

  • Empirical evaluation of 50 emoji-augmented jailbreak prompts across four open-source LLMs (Mistral 7B, Qwen 2 7B, Gemma 2 9B, Llama 3 8B)
  • Introduction and characterization of two emoji-based attack techniques: emoji stuffing (inserting emojis between words) and emoji chaining (emoji sequences representing instructions)
  • Statistical analysis (chi-square) confirming significant inter-model differences in vulnerability, with Gemma 2 9B and Mistral 7B at 10% success and Qwen 2 7B at 0%

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
50 custom emoji-augmented jailbreak prompts
Applications
conversational aillm safety alignmentcontent moderation