attack 2025

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

0 citations

Published on arXiv

2509.06350

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Pruning a minority of low-impact tokens from GCG adversarial suffixes reduces average attack time by 16.8% and perplexity by 24% while fully preserving attack success rate across three model families.

Mask-GCG

Novel technique introduced

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

Key Contributions

Proposes Mask-GCG, a plug-and-play learnable token masking mechanism that identifies and prunes redundant tokens from GCG adversarial suffixes without compromising attack success rate.
Reveals a token importance hierarchy in adversarial suffixes: >83% of tokens are high-impact (semantically rich), while low-impact tokens (punctuation, function words) can be pruned.
Achieves 7.5% average suffix compression, 16.8% reduction in attack time, and 24% reduction in perplexity while maintaining ASR across GCG, I-GCG, and AmpleGCG variants.

🛡️ Threat Analysis

Input Manipulation Attack

Mask-GCG is a gradient-based adversarial suffix optimization attack — it uses gradient information and learnable masks to identify high-impact token positions within discrete adversarial suffixes, directly falling under the adversarial suffix optimization subcategory of ML01.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeteddigital

Datasets

AdvBench

Applications

llm jailbreakingchatbot safety bypass

Read PDF arXiv

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Hidden State Poisoning Attacks against Mamba-based Language Models

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

Bypassing Prompt Injection Detectors through Evasive Injections

Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization