attack 2026

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Zeyuan He ^1,2, Yupeng Chen ^1,2, Lang Lin ³, Yihan Wang ², Shenxu Chang ¹, Eric Sommerlade ⁴, Philip Torr ¹, Junchi Yu ¹, Adel Bibi ¹, Jialin Yu ^1,4

¹ University of Oxford

² The Chinese University of Hong Kong, Shenzhen

³ The University of Texas at Austin

⁴ Microsoft

0 citations · 38 references · arXiv

Published on arXiv

2602.00388

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Context nesting achieves state-of-the-art jailbreak success rates across diffusion LLMs and enables the first successful jailbreak of Gemini Diffusion, exposing a critical vulnerability in commercial D-LLMs.

Context Nesting

Novel technique introduced

Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. We identify a simple yet effective failure mode, termed context nesting, where harmful requests are embedded within structured benign contexts, effectively bypassing the stepwise reduction mechanism. Empirically, we show that this simple strategy is sufficient to bypass D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Most notably, it enables the first successful jailbreak of Gemini Diffusion, to our knowledge, exposing a critical vulnerability in commercial D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs' safety blessing, constituting an early-stage red-teaming of D-LLMs.

Key Contributions

Identifies and mechanistically explains a 'safety blessing' in D-LLMs: the diffusion trajectory's stepwise reduction effect progressively suppresses unsafe token generations, conferring intrinsic robustness against AR-LLM jailbreaks.
Proposes 'context nesting', a simple prompt-level attack that embeds harmful requests inside structured benign contexts to bypass the stepwise reduction mechanism.
Demonstrates SOTA jailbreak success rates on multiple D-LLMs, including the first reported successful jailbreak of commercial Gemini Diffusion.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

AdvBenchHarmBench

Applications

llm chatbotstext generation

Read PDF arXiv DOI

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

PINA: Prompt Injection Attack against Navigation Agents

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

BreakFun: Jailbreaking LLMs via Schema Exploitation