Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models
Xiangwen Wang 1, Ananth Balashankar 2, Varun Chandrasekaran 2
Published on arXiv
2603.11149
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Prompting-based jailbreak paradigms are the most compute-efficient across model families, and misinformation-related harms are systematically easier to elicit than other harm types regardless of attack paradigm.
Jailbreak Scaling Laws
Novel technique introduced
Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient compared to optimization-based methods. To explain this gap, we cast prompt-based updates into an optimization view and show via a same-state comparison that prompt-based attacks more effectively optimize in prompt space. We also show that attacks occupy distinct success--stealthiness operating points with prompting-based methods occupying the high-success, high-stealth region. Finally, we find that vulnerability is strongly goal-dependent: harms involving misinformation are typically easier to elicit than other non-misinformation harms.
Key Contributions
- Proposes a FLOPs-normalized scaling-law framework (saturating exponential fit) to compare jailbreak attack efficiency across four paradigms on a unified compute axis
- Shows that prompt-based attacks are more compute-efficient than optimization-based attacks and explains this gap by casting prompt updates into an optimization view showing better prompt-space coverage
- Identifies that vulnerability is strongly goal-dependent, with misinformation harms being systematically easier to elicit, and maps attacks to distinct success–stealthiness operating points
🛡️ Threat Analysis
One of the four evaluated paradigms is optimization-based attacks (gradient-based adversarial suffix optimization like GCG), which are canonical ML01 input manipulation attacks at inference time.