Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection
Olga Sorokoletova , Francesco Giarrusso , Vincenzo Suriani , Daniele Nardi
Published on arXiv
2510.13893
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Taxonomy-guided prompting improves GPT-5's jailbreak detection performance, and multi-turn attacks that distribute malicious intent across benign-looking turns are the hardest for existing guardrails to detect.
Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcomes of our experiments are fourfold. First, we developed a comprehensive hierarchical taxonomy of jailbreak strategies that systematically consolidates techniques previously studied in isolation and harmonizes existing, partially overlapping classifications with explicit cross-references to prior categorizations. The taxonomy organizes jailbreak strategies into seven mechanism-oriented families: impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmarked GPT-5 as a judge for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.
Key Contributions
- A comprehensive hierarchical, mechanism-oriented taxonomy of jailbreak strategies organized into 7 families, consolidating and harmonizing prior fragmented classifications.
- A new Italian dataset of 1,364 multi-turn adversarial dialogues annotated with the proposed taxonomy, addressing a gap in multilingual and multi-turn jailbreak resources.
- Empirical benchmarking of GPT-5 as a taxonomy-guided jailbreak judge, evaluating how structured prompting improves automatic detection accuracy.