When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers
Zhaoxin Zhang 1, Borui Chen 2, Yiming Hu 3, Youyang Qu 4, Tianqing Zhu 1, Longxiang Gao 5
4 CSIRO
Published on arXiv
2511.21718
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
MICM consistently achieves higher jailbreak success rates than state-of-the-art baselines across five advanced LLMs including GPT-4o and Deepseek-R1 by bypassing safety filters via abstract conceptual manipulation.
MICM (Morphology Inspired Conceptual Manipulation)
Novel technique introduced
Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.
Key Contributions
- Introduces MICM, a model-agnostic jailbreak that exploits conceptual morphology theory to encode extremist ideological configurations into fixed prompt templates via concept-embedded triggers (CETs)
- Reveals a critical blind spot in LLM safety alignment: safety filters focus on explicit harm signals but are vulnerable to covert manipulation of aggregate value orientation
- Empirically demonstrates consistent outperformance over state-of-the-art jailbreaks on five advanced LLMs including GPT-4o, Deepseek-R1, and Qwen3-8B