attack 2025

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

0 citations · 39 references · arXiv

Published on arXiv

2511.21718

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MICM consistently achieves higher jailbreak success rates than state-of-the-art baselines across five advanced LLMs including GPT-4o and Deepseek-R1 by bypassing safety filters via abstract conceptual manipulation.

MICM (Morphology Inspired Conceptual Manipulation)

Novel technique introduced

Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.

Key Contributions

Introduces MICM, a model-agnostic jailbreak that exploits conceptual morphology theory to encode extremist ideological configurations into fixed prompt templates via concept-embedded triggers (CETs)
Reveals a critical blind spot in LLM safety alignment: safety filters focus on explicit harm signals but are vulnerable to covert manipulation of aggregate value orientation
Empirically demonstrates consistent outperformance over state-of-the-art jailbreaks on five advanced LLMs including GPT-4o, Deepseek-R1, and Qwen3-8B

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Applications

llm chatbotscommercial llms

Read PDF arXiv DOI

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

MEEA: Mere Exposure Effect-Driven Confrontational Optimization for LLM Jailbreaking

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

Uncovering the Vulnerability of Large Language Models in the Financial Domain via Risk Concealment

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking