The Resurgence of GCG Adversarial Attacks on Large Language Models

Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models' loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.

Key Contributions

Systematic scaling study of GCG attacks up to a 20B-parameter model (GPT-OSS-20B), showing ASR decreases with model size
Dual evaluation protocol demonstrating that prefix-based heuristics substantially overestimate ASR compared to GPT-4o semantic judgments
T-GCG, a simulated annealing extension of GCG that diversifies adversarial token search and reveals that coding prompts are significantly more vulnerable than safety-oriented prompts

🛡️ Threat Analysis

Input Manipulation Attack

GCG and T-GCG are gradient-based adversarial token-level suffix optimization attacks — exactly the 'adversarial suffix optimization on LLMs' scenario listed under ML01. Token perturbations are computed via gradients, not natural language manipulation.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Datasets

AdvBench

Applications

2025 34 cit.

Input Manipulation Attack

87%

The Resurgence of GCG Adversarial Attacks on Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

H-Node Attack and Defense in Large Language Models

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections