GCG Attack On A Diffusion LLM
Published on arXiv
2601.14266
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Prefix perturbation was the most computationally efficient and effective GCG variant; attacks on the LLaDA-Instruct model were overwhelmingly unsuccessful; the unmodified LLaDA-Base refused 93.65% of AdvBench prompts, suggesting baseline robustness but with a non-trivial attack surface.
GCG (Greedy Coordinate Gradient) for diffusion LLMs
Novel technique introduced
While most LLMs are autoregressive, diffusion-based LLMs have recently emerged as an alternative method for generation. Greedy Coordinate Gradient (GCG) attacks have proven effective against autoregressive models, but their applicability to diffusion language models remains largely unexplored. In this work, we present an exploratory study of GCG-style adversarial prompt attacks on LLaDA (Large Language Diffusion with mAsking), an open-source diffusion LLM. We evaluate multiple attack variants, including prefix perturbations and suffix-based adversarial generation, on harmful prompts drawn from the AdvBench dataset. Our study provides initial insights into the robustness and attack surface of diffusion language models and motivates the development of alternative optimization and evaluation strategies for adversarial analysis in this setting.
Key Contributions
- First exploratory implementation of GCG-style gradient-based attacks on a diffusion LLM (LLaDA), covering prefix permutation and adversarial suffix generation variants
- Empirical comparison of three GCG variants (prefix, random suffix, Qwen-seeded suffix) on LLaDA-Base and LLaDA-Instruct using AdvBench harmful prompts
- Preliminary insights into the attack surface and robustness of diffusion language models, motivating new optimization and evaluation strategies for this model class
🛡️ Threat Analysis
GCG is a gradient-based adversarial token-level optimization attack (suffix/prefix perturbation) targeting a language model at inference time — the taxonomy explicitly identifies adversarial suffix optimization on LLMs as ML01.