attack 2025

Untargeted Jailbreak Attack

2 citations · 38 references · arXiv

Published on arXiv

2510.02999

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

UJA achieves over 80% attack success rate against recent safety-aligned LLMs with only 100 optimization iterations, outperforming state-of-the-art gradient-based jailbreaks by over 30%.

UJA (Untargeted Jailbreak Attack)

Novel technique introduced

Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed targets inherently constrains the adversarial search space, limiting the overall attack efficacy. Furthermore, existing methods typically require numerous optimization iterations to fulfill the large gap between the fixed target and the original LLM output, resulting in low attack efficiency. To overcome these limitations, we propose the first gradient-based untargeted jailbreak attack (UJA), which relies on an untargeted objective to maximize the unsafety probability of the LLM output, without enforcing any response patterns. For tractable optimization, we further decompose this objective into two differentiable sub-objectives to search the optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to existing attacks, UJA's unrestricted objective significantly expands the search space, enabling more flexible and efficient exploration of LLM vulnerabilities. Extensive evaluations show that UJA achieves over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks by over 30\%.

Key Contributions

First gradient-based untargeted jailbreak attack that maximizes unsafety probability of LLM outputs without enforcing any fixed response patterns, expanding the adversarial search space.
Theoretical decomposition of the non-differentiable unsafety-maximization objective into two differentiable sub-objectives for tractable optimization.
Achieves >80% attack success rate against safety-aligned LLMs in only 100 iterations, outperforming SOTA gradient-based attacks (e.g., GCG) by over 30%.

🛡️ Threat Analysis

Input Manipulation Attack

UJA uses gradient-based token-level adversarial suffix optimization on LLMs — the same attack surface as GCG — but replaces the fixed-target objective with an untargeted unsafety-maximization objective, directly fitting the adversarial suffix / input manipulation attack paradigm.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timeuntargeted

Applications

safety-aligned llmsllm safety evaluation

Read PDF arXiv DOI Code

Untargeted Jailbreak Attack

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reliability Crisis of Reference-free Metrics for Grammatical Error Correction

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

H-Node Attack and Defense in Large Language Models

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs