Untargeted Jailbreak Attack
Xinzhe Huang 1,2, Wenjing Hu 3, Tianhang Zheng 1,2, Kedong Xiu 1,2, Xiaojun Jia 4, Di Wang 5, Zhan Qin 1,2, Kui Ren 1,2
2 Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
3 Nanjing University of Science and Technology
Published on arXiv
2510.02999
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
UJA achieves over 80% attack success rate against recent safety-aligned LLMs with only 100 optimization iterations, outperforming state-of-the-art gradient-based jailbreaks by over 30%.
UJA (Untargeted Jailbreak Attack)
Novel technique introduced
Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed targets inherently constrains the adversarial search space, limiting the overall attack efficacy. Furthermore, existing methods typically require numerous optimization iterations to fulfill the large gap between the fixed target and the original LLM output, resulting in low attack efficiency. To overcome these limitations, we propose the first gradient-based untargeted jailbreak attack (UJA), which relies on an untargeted objective to maximize the unsafety probability of the LLM output, without enforcing any response patterns. For tractable optimization, we further decompose this objective into two differentiable sub-objectives to search the optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to existing attacks, UJA's unrestricted objective significantly expands the search space, enabling more flexible and efficient exploration of LLM vulnerabilities. Extensive evaluations show that UJA achieves over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks by over 30\%.
Key Contributions
- First gradient-based untargeted jailbreak attack that maximizes unsafety probability of LLM outputs without enforcing any fixed response patterns, expanding the adversarial search space.
- Theoretical decomposition of the non-differentiable unsafety-maximization objective into two differentiable sub-objectives for tractable optimization.
- Achieves >80% attack success rate against safety-aligned LLMs in only 100 iterations, outperforming SOTA gradient-based attacks (e.g., GCG) by over 30%.
🛡️ Threat Analysis
UJA uses gradient-based token-level adversarial suffix optimization on LLMs — the same attack surface as GCG — but replaces the fixed-target objective with an untargeted unsafety-maximization objective, directly fitting the adversarial suffix / input manipulation attack paradigm.