defense 2025

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

3 citations · 2 influential · 59 references · arXiv

Published on arXiv

2509.24296

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DiffuGuard reduces Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% across four dLLMs while preserving model utility and generation efficiency.

DiffuGuard

Novel technique introduced

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

Key Contributions

Identifies 'Denoising-path Dependence' — early-stage token safety decisively shapes final output safety — and documents harmful bias in the standard greedy remasking strategy of dLLMs
Proposes Stochastic Annealing Remasking, which dynamically injects controlled randomness to mitigate greedy selection bias during generation
Proposes Block-level Audit and Repair, exploiting internal model representations for autonomous harmful-content detection and guided in-generation correction

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformerdiffusion

Threat Tags

inference_timeblack_box

Applications

text generation safetydiffusion language model safety

Read PDF arXiv DOI Code

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Defend LLMs Through Self-Consciousness

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

Securing AI Agents Against Prompt Injection Attacks

SecInfer: Preventing Prompt Injection via Inference-time Scaling

SAID: Empowering Large Language Models with Self-Activating Internal Defense

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection