attack 2025

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy 1, Andrew Zagula 2, Nicholas Saban 3

5 citations · 65 references · arXiv

α

Published on arXiv

2511.02376

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 95% attack success rate on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines, with persistent vulnerabilities across GPT-4o mini, Qwen3-235B, and Mistral-7B

AutoAdv

Novel technique introduced


Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs. Yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves an attack success rate of up to 95% on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests and then iteratively refines them. Extensive evaluation across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.


Key Contributions

  • Training-free multi-turn jailbreaking framework (AutoAdv) combining a pattern manager, temperature manager, and two-phase rewriting strategy for adaptive adversarial prompting
  • Pattern manager that accumulates successful jailbreak strategies and reuses them to improve future attack turns
  • Empirical demonstration that single-turn alignment strategies fail to generalize to multi-turn conversations, achieving 95% ASR on Llama-3.1-8B and 24% improvement over single-turn baselines

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeteddigital
Datasets
AdvBenchHarmBench
Applications
llm chatbotsconversational ai safety