benchmark 2025

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Olga Sorokoletova , Francesco Giarrusso , Vincenzo Suriani , Daniele Nardi

2 citations · 47 references · arXiv

α

Published on arXiv

2510.13893

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Taxonomy-guided prompting improves GPT-5's jailbreak detection performance, and multi-turn attacks that distribute malicious intent across benign-looking turns are the hardest for existing guardrails to detect.


Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcomes of our experiments are fourfold. First, we developed a comprehensive hierarchical taxonomy of jailbreak strategies that systematically consolidates techniques previously studied in isolation and harmonizes existing, partially overlapping classifications with explicit cross-references to prior categorizations. The taxonomy organizes jailbreak strategies into seven mechanism-oriented families: impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmarked GPT-5 as a judge for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.


Key Contributions

  • A comprehensive hierarchical, mechanism-oriented taxonomy of jailbreak strategies organized into 7 families, consolidating and harmonizing prior fragmented classifications.
  • A new Italian dataset of 1,364 multi-turn adversarial dialogues annotated with the proposed taxonomy, addressing a gap in multilingual and multi-turn jailbreak resources.
  • Empirical benchmarking of GPT-5 as a taxonomy-guided jailbreak judge, evaluating how structured prompting improves automatic detection accuracy.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_boxtargeted
Datasets
Italian multi-turn adversarial dialogue dataset (1364 samples, new)
Applications
llm safety systemsjailbreak detectionadversarial prompt detection