GCG Attack On A Diffusion LLM | ML Security Papers

While most LLMs are autoregressive, diffusion-based LLMs have recently emerged as an alternative method for generation. Greedy Coordinate Gradient (GCG) attacks have proven effective against autoregressive models, but their applicability to diffusion language models remains largely unexplored. In this work, we present an exploratory study of GCG-style adversarial prompt attacks on LLaDA (Large Language Diffusion with mAsking), an open-source diffusion LLM. We evaluate multiple attack variants, including prefix perturbations and suffix-based adversarial generation, on harmful prompts drawn from the AdvBench dataset. Our study provides initial insights into the robustness and attack surface of diffusion language models and motivates the development of alternative optimization and evaluation strategies for adversarial analysis in this setting.

Key Contributions

First exploratory implementation of GCG-style gradient-based attacks on a diffusion LLM (LLaDA), covering prefix permutation and adversarial suffix generation variants
Empirical comparison of three GCG variants (prefix, random suffix, Qwen-seeded suffix) on LLaDA-Base and LLaDA-Instruct using AdvBench harmful prompts
Preliminary insights into the attack surface and robustness of diffusion language models, motivating new optimization and evaluation strategies for this model class

🛡️ Threat Analysis

Input Manipulation Attack

GCG is a gradient-based adversarial token-level optimization attack (suffix/prefix perturbation) targeting a language model at inference time — the taxonomy explicitly identifies adversarial suffix optimization on LLMs as ML01.

Details

Domains

nlp

Model Types

llmdiffusion

Threat Tags

white_boxinference_timetargeted

Datasets

AdvBench

Applications

2025 0 cit.

Input Manipulation Attack

87%