defense 2025

Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position

Zhixin Xie , Xurui Song , Jun Luo

0 citations

α

Published on arXiv

2508.12398

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MOSA substantially reduces attack success rates against eight jailbreaking methods while preserving dLLM utility on coding, mathematics, and general reasoning benchmarks.

MOSA (Middle-tOken Safety Alignment)

Novel technique introduced


Diffusion Large Language Models (dLLMs) have recently emerged as a competitive non-autoregressive paradigm due to their unique training and inference approach. However, there is currently a lack of safety study on this novel architecture. In this paper, we present the first analysis of dLLMs' safety performance and propose a novel safety alignment method tailored to their unique generation characteristics. Specifically, we identify a critical asymmetry between the defender and attacker in terms of security. For the defender, we reveal that the middle tokens of the response, rather than the initial ones, are more critical to the overall safety of dLLM outputs; this seems to suggest that aligning middle tokens can be more beneficial to the defender. The attacker, on the contrary, may have limited power to manipulate middle tokens, as we find dLLMs have a strong tendency towards a sequential generation order in practice, forcing the attack to meet this distribution and diverting it from influencing the critical middle tokens. Building on this asymmetry, we introduce Middle-tOken Safety Alignment (MOSA), a novel method that directly aligns the model's middle generation with safe refusals exploiting reinforcement learning. We implement MOSA and compare its security performance against eight attack methods on two benchmarks. We also test the utility of MOSA-aligned dLLM on coding, math, and general reasoning. The results strongly prove the superiority of MOSA.


Key Contributions

  • First systematic security analysis of diffusion LLMs (dLLMs), showing that middle tokens — not initial tokens — are most critical for safety, unlike in autoregressive LLMs.
  • Discovery of an attacker/defender asymmetry in dLLMs: defenders can align middle tokens directly, while attackers are constrained by dLLMs' sequential generation tendency to influence only initial tokens.
  • MOSA (Middle-tOken Safety Alignment), an RL-based alignment method that anchors middle-token generation to a safe refusal template, reducing attack success while preserving utility on coding, math, and reasoning tasks.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeinference_timeblack_box
Datasets
AdvBenchHarmBench
Applications
large language model safety alignmentjailbreak defense