Hayfa Dhahbi

h-index: 0 0 citations 1 papers (total)

Papers in Database (1)

benchmark arXiv Feb 10, 2026 · 7w ago

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Hayfa Dhahbi, Kashyap Thimmaraju · Technische Universität Berlin

Proposes Four-Checkpoint Framework and WASR metric to diagnose which LLM safety layers break under 13 prompt-level jailbreak techniques

Prompt Injection nlp
PDF