benchmark 2026

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Hayfa Dhahbi , Kashyap Thimmaraju

0 citations · 27 references · arXiv (Cornell University)

α

Published on arXiv

2602.09629

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

WASR reveals 52.7% overall vulnerability across three frontier LLMs — 2.3× higher than binary ASR (22.6%) — with output-stage checkpoints (CP3/CP4) most exploitable at 72–79% WASR.

Four-Checkpoint Framework / WASR

Novel technique introduced


Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.


Key Contributions

  • Four-Checkpoint Framework organizing LLM safety along two dimensions (processing stage: input/output × detection level: literal/intent), enabling independent evaluation of each defensive layer
  • Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage invisible to binary ASR — revealing 52.7% vulnerability vs. 22.6% under binary evaluation (2.3× gap)
  • Empirical evaluation of GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 black-box test cases, showing output-stage defenses (CP3/CP4) are weakest (72–79% WASR) while input-literal defenses (CP1) are strongest (13% WASR)

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
custom 3,312 single-turn test cases (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro)
Applications
llm chatbotsconversational ai safety