attack 2025

When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

Shiyao Cui 1, Xijia Feng 2, Yingkang Wang 1, Junxiao Yang 1, Zhexin Zhang 1, Biplab Sikdar 2, Hongning Wang 1, Han Qiu 1, Minlie Huang 1

0 citations

α

Published on arXiv

2509.11141

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Emoji-substituted prompts achieve nearly 50% higher toxicity generation rate than plain-text counterparts in GPT-4o, with consistent results across 7 LLMs and 5 languages.

Emoji-triggered toxicity jailbreak

Novel technique introduced


Emojis are globally used non-verbal cues in digital communication, and extensive research has examined how large language models (LLMs) understand and utilize emojis across contexts. While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. We begin with a comprehensive exploration of emoji-triggered LLM toxicity generation by automating the construction of prompts with emojis to subtly express toxic intent. Experiments across 5 mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate that prompts with emojis could easily induce toxicity generation. To understand this phenomenon, we conduct model-level interpretations spanning semantic cognition, sequence generation and tokenization, suggesting that emojis can act as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue deeper insights, we further probe the pre-training corpus and uncover potential correlation between the emoji-related data polution with the toxicity generation behaviors. Supplementary materials provide our implementation code and data. (Warning: This paper contains potentially sensitive contents)


Key Contributions

  • Systematic demonstration that emoji-substituted prompts can bypass LLM safety mechanisms, producing an emoji-version of the AdvBench red-teaming benchmark across 5 languages and 7 LLMs
  • Multi-level mechanistic interpretation showing emojis create a heterogeneous semantic/tokenization channel that reduces LLM sensitivity to harmful intent
  • Empirical correlation between emoji-related data pollution in pre-training corpora and observed toxicity generation behaviors

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
llm safety alignmentcontent moderationchatbot