attack 2026

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Ki Sen Hung ¹, Xi Yang ¹, Chang Liu ², Haoran Li ¹, Kejiang Chen ², Changxuan Fan ¹, Tsun On Kwok ¹, Weiming Zhang ², Xiaomeng Li ¹, Yangqiu Song ¹

¹ The Hong Kong University of Science and Technology

² University of Science and Technology of China

0 citations

Published on arXiv

2604.15717

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 93%+ average attack success rate across seven frontier models (93% on GPT-5.2, 100% on Claude-Opus-4.5), substantially outperforming existing jailbreak methods

Jargon

Novel technique introduced

A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.

Key Contributions

Discovers that safety-research contexts trigger broader defense relaxation than domain-specific contexts across LLMs
Proposes Jargon framework combining safety-research contexts with multi-turn adversarial interactions, achieving 93%+ attack success rates on GPT-5.2, Claude-4.5, and Gemini-3
Reveals activation space analysis showing Jargon queries occupy a 'gray zone' between benign and harmful inputs where refusal mechanisms fail
Develops policy-guided safeguard and alignment fine-tuning defense reducing attack success while preserving helpfulness

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Applications

conversational aichatbot safetyllm alignment

Read PDF arXiv Code

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

In-Context Representation Hijacking

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions