attack 2025

GCG Attack On A Diffusion LLM

Ruben Neyroud , Sam Corley

0 citations · 3 references · arXiv

α

Published on arXiv

2601.14266

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Prefix perturbation was the most computationally efficient and effective GCG variant; attacks on the LLaDA-Instruct model were overwhelmingly unsuccessful; the unmodified LLaDA-Base refused 93.65% of AdvBench prompts, suggesting baseline robustness but with a non-trivial attack surface.

GCG (Greedy Coordinate Gradient) for diffusion LLMs

Novel technique introduced


While most LLMs are autoregressive, diffusion-based LLMs have recently emerged as an alternative method for generation. Greedy Coordinate Gradient (GCG) attacks have proven effective against autoregressive models, but their applicability to diffusion language models remains largely unexplored. In this work, we present an exploratory study of GCG-style adversarial prompt attacks on LLaDA (Large Language Diffusion with mAsking), an open-source diffusion LLM. We evaluate multiple attack variants, including prefix perturbations and suffix-based adversarial generation, on harmful prompts drawn from the AdvBench dataset. Our study provides initial insights into the robustness and attack surface of diffusion language models and motivates the development of alternative optimization and evaluation strategies for adversarial analysis in this setting.


Key Contributions

  • First exploratory implementation of GCG-style gradient-based attacks on a diffusion LLM (LLaDA), covering prefix permutation and adversarial suffix generation variants
  • Empirical comparison of three GCG variants (prefix, random suffix, Qwen-seeded suffix) on LLaDA-Base and LLaDA-Instruct using AdvBench harmful prompts
  • Preliminary insights into the attack surface and robustness of diffusion language models, motivating new optimization and evaluation strategies for this model class

🛡️ Threat Analysis

Input Manipulation Attack

GCG is a gradient-based adversarial token-level optimization attack (suffix/prefix perturbation) targeting a language model at inference time — the taxonomy explicitly identifies adversarial suffix optimization on LLMs as ML01.


Details

Domains
nlp
Model Types
llmdiffusion
Threat Tags
white_boxinference_timetargeted
Datasets
AdvBench
Applications
llm safetytext generationjailbreaking