attack 2026

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Xiaobing Sun ¹, Perry Lam ², Shaohua Li ¹, Zizhou Wang ¹, Rick Siow Mong Goh ¹, Yong Liu ¹, Liangli Zhen ¹

¹ A*STAR

² Singapore University of Technology and Design

0 citations

Published on arXiv

2603.16192

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 12.4% and 9.7% Attack Success Rate improvements over SOTA on HarmBench and JBB-Behaviors respectively; outperforms strongest baseline by 26% on GPT-5-mini

Structured Semantic Cloaking (S2C)

Novel technique introduced

Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

Key Contributions

Structured Semantic Cloaking (S2C) framework with three mechanisms: Contextual Reframing, Content Fragmentation, and Clue-Guided Camouflage
Novel approach that delays semantic consolidation by distributing malicious intent across disjoint prompt segments requiring multi-step inference
Achieves 12.4% ASR improvement on HarmBench and 9.7% on JBB-Behaviors, with 26% improvement over strongest baseline on GPT-5-mini

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Datasets

HarmBenchJBB-Behaviors

Applications

llm safety evaluationred teaming

Read PDF arXiv

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

In-Context Representation Hijacking

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference