α

Published on arXiv

2510.02999

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

UJA achieves over 80% attack success rate against recent safety-aligned LLMs with only 100 optimization iterations, outperforming state-of-the-art gradient-based jailbreaks by over 30%.

UJA (Untargeted Jailbreak Attack)

Novel technique introduced


Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed targets inherently constrains the adversarial search space, limiting the overall attack efficacy. Furthermore, existing methods typically require numerous optimization iterations to fulfill the large gap between the fixed target and the original LLM output, resulting in low attack efficiency. To overcome these limitations, we propose the first gradient-based untargeted jailbreak attack (UJA), which relies on an untargeted objective to maximize the unsafety probability of the LLM output, without enforcing any response patterns. For tractable optimization, we further decompose this objective into two differentiable sub-objectives to search the optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to existing attacks, UJA's unrestricted objective significantly expands the search space, enabling more flexible and efficient exploration of LLM vulnerabilities. Extensive evaluations show that UJA achieves over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks by over 30\%.


Key Contributions

  • First gradient-based untargeted jailbreak attack that maximizes unsafety probability of LLM outputs without enforcing any fixed response patterns, expanding the adversarial search space.
  • Theoretical decomposition of the non-differentiable unsafety-maximization objective into two differentiable sub-objectives for tractable optimization.
  • Achieves >80% attack success rate against safety-aligned LLMs in only 100 iterations, outperforming SOTA gradient-based attacks (e.g., GCG) by over 30%.

🛡️ Threat Analysis

Input Manipulation Attack

UJA uses gradient-based token-level adversarial suffix optimization on LLMs — the same attack surface as GCG — but replaces the fixed-target objective with an untargeted unsafety-maximization objective, directly fitting the adversarial suffix / input manipulation attack paradigm.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timeuntargeted
Applications
safety-aligned llmsllm safety evaluation