benchmark 2025

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Samuel Nathanson ^1,2, Rebecca Williams ^1,2, Cynthia Matuszek ²

¹ Johns Hopkins University Applied Physics Laboratory

² University of Maryland, Baltimore County

1 citations · 30 references · arXiv

Published on arXiv

2511.13788

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Log-scaled attacker-to-target size ratio correlates significantly with jailbreak harm (r=0.51), and attacker refusal rate is a stronger predictor of adversarial outcomes (rho=-0.93) than target model characteristics.

Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically significant correlation between mean harm and the logarithm of the attacker-to-target size ratio (Pearson r = 0.51, p < 0.001; Spearman rho = 0.52, p < 0.001), indicating that relative model size correlates with the likelihood and severity of harmful completions. Mean harm score variance is higher across attackers (0.18) than across targets (0.10), suggesting that attacker-side behavioral diversity contributes more to adversarial outcomes than target susceptibility. Attacker refusal frequency is strongly and negatively correlated with harm (rho = -0.93, p < 0.001), showing that attacker-side alignment mitigates harmful responses. These findings reveal that size asymmetry influences robustness and provide exploratory evidence for adversarial scaling patterns, motivating more controlled investigations into inter-model alignment and safety.

Key Contributions

Empirical framework simulating 6,000+ multi-turn LLM-to-LLM adversarial exchanges across 0.6B–120B parameter models using JailbreakBench prompts
Demonstrates statistically significant correlation (Pearson r=0.51, p<0.001) between log attacker-to-target size ratio and mean harm score
Shows attacker refusal frequency is more predictive of adversarial outcomes (rho=-0.93) than target susceptibility, highlighting the role of attacker-side alignment

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

JailbreakBench

Applications

llm safety evaluationmulti-agent ai systems

Read PDF arXiv DOI

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers

Quantifying CBRN Risk in Frontier Models

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests