benchmark 2025

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin ¹, Dingfan Chen ², Linyi Yang ³, Michael Backes ¹, Xiao Zhang ¹

¹ CISPA Helmholtz Center for Information Security

² Max Planck Institute for Intelligent Systems

³ Southern University of Science and Technology

0 citations · 57 references · arXiv

Published on arXiv

2512.24044

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Nearly all evaluated jailbreak techniques can be detected by at least one content safety filter, suggesting prior attack success-rate claims are inflated due to evaluating models in isolation without deployment-stage filtering.

As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies reporting high success rates in evading common LLMs. However, previous evaluations have focused solely on the models, neglecting the full deployment pipeline, which typically incorporates additional safety mechanisms like content moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages. Our findings yield two key insights: first, nearly all evaluated jailbreak techniques can be detected by at least one safety filter, suggesting that prior assessments may have overestimated the practical success of these attacks; second, while safety filters are effective in detection, there remains room to better balance recall and precision to further optimize protection and user experience. We highlight critical gaps and call for further refinement of detection accuracy and usability in LLM safety systems.

Key Contributions

First systematic evaluation of jailbreak attacks against the full LLM deployment pipeline, including both input and output content filtering stages — not just the model itself
Demonstrates that prior evaluations overestimated jailbreak success rates by neglecting content moderation filters, since nearly all evaluated jailbreak techniques are detectable by at least one safety filter
Identifies precision-recall trade-offs in current safety filters and calls for further refinement of detection accuracy and usability

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

llm safety systemscontent moderation filterschatbots

Read PDF arXiv DOI

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quantifying CBRN Risk in Frontier Models

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests