benchmark 2025

Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers

Giacomo Bertollo , Naz Bodemir , Jonah Burgess

0 citations · 11 references · arXiv

α

Published on arXiv

2510.16005

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Common jailbreak techniques easily bypassed simple AI guardrails, but layered multi-step defenses posed significant challenges, validating a defense-in-depth strategy for AI safety


Analyzing 500 CTF participants, this paper shows that while participants readily bypassed simple AI guardrails using common techniques, layered multi-step defenses still posed significant challenges, offering concrete insights for building safer AI systems.


Key Contributions

  • Large-scale empirical study of 500 CTF participants attempting to bypass AI guardrails, characterizing which attack techniques succeed
  • Demonstrates that simple single-layer guardrails are readily defeated while layered multi-step defenses remain significantly more robust
  • Provides concrete, practitioner-oriented insights for designing more resilient AI safety systems

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
CTF competition dataset (500 participants)
Applications
ai safety guardrailschatbot safetyllm content filtering