attack 2025

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

Pavlos Ntais

1 citations · 10 references · arXiv

α

Published on arXiv

2510.22085

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LoRA-fine-tuned Mistral-7B achieves 81.0% ASR against GPT-OSS-20B and 66.5% against GPT-4, a 54x improvement over direct prompting (1.5% baseline), with highest vulnerability in technical/cybersecurity domains (93% ASR).

Jailbreak Mimicry

Novel technique introduced


Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecurity applications. We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts in a one-shot manner. Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process, enabling proactive vulnerability assessment in AI-driven security systems. Developed for the OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out test set of 200 items. Cross-model evaluation reveals significant variation in vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad applicability and model-specific defensive strengths in cybersecurity contexts. This represents a 54x improvement over direct prompting (1.5% ASR) and demonstrates systematic vulnerabilities in current safety alignment approaches. Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable, highlighting threats to AI-integrated threat detection, malware analysis, and secure systems, while physical harm categories show greater resistance (55.6% ASR). We employ automated harmfulness evaluation using Claude Sonnet 4, cross-validated with human expert assessment, ensuring reliable and scalable evaluation for cybersecurity red-teaming. Finally, we analyze failure mechanisms and discuss defensive strategies to mitigate these vulnerabilities in AI for cybersecurity.


Key Contributions

  • Jailbreak Mimicry: automated one-shot narrative jailbreak generation via LoRA fine-tuning of Mistral-7B on AdvBench-derived data, achieving 81% ASR against GPT-OSS-20B (54x over direct prompting baseline)
  • Cross-model vulnerability analysis across four major LLM families revealing domain-specific patterns (Cybersecurity: 93% ASR, physical harm: 55.6% ASR)
  • Hybrid evaluation framework combining automated Claude Sonnet harmfulness scoring with human expert cross-validation for scalable red-teaming

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
AdvBench
Applications
llm safety alignmentai red-teamingcybersecurity ai systems