benchmark 2025
Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers
Giacomo Bertollo , Naz Bodemir , Jonah Burgess
0 citations · 11 references · arXiv
α
Published on arXiv
2510.16005
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Common jailbreak techniques easily bypassed simple AI guardrails, but layered multi-step defenses posed significant challenges, validating a defense-in-depth strategy for AI safety
Analyzing 500 CTF participants, this paper shows that while participants readily bypassed simple AI guardrails using common techniques, layered multi-step defenses still posed significant challenges, offering concrete insights for building safer AI systems.
Key Contributions
- Large-scale empirical study of 500 CTF participants attempting to bypass AI guardrails, characterizing which attack techniques succeed
- Demonstrates that simple single-layer guardrails are readily defeated while layered multi-step defenses remain significantly more robust
- Provides concrete, practitioner-oriented insights for designing more resilient AI safety systems
🛡️ Threat Analysis
Details
Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
CTF competition dataset (500 participants)
Applications
ai safety guardrailschatbot safetyllm content filtering