Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models
Xiaobing Sun 1, Perry Lam 2, Shaohua Li 1, Zizhou Wang 1, Rick Siow Mong Goh 1, Yong Liu 1, Liangli Zhen 1
Published on arXiv
2603.16192
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves 12.4% and 9.7% Attack Success Rate improvements over SOTA on HarmBench and JBB-Behaviors respectively; outperforms strongest baseline by 26% on GPT-5-mini
Structured Semantic Cloaking (S2C)
Novel technique introduced
Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.
Key Contributions
- Structured Semantic Cloaking (S2C) framework with three mechanisms: Contextual Reframing, Content Fragmentation, and Clue-Guided Camouflage
- Novel approach that delays semantic consolidation by distributing malicious intent across disjoint prompt segments requiring multi-step inference
- Achieves 12.4% ASR improvement on HarmBench and 9.7% on JBB-Behaviors, with 26% improvement over strongest baseline on GPT-5-mini