Hayfa Dhahbi

benchmark arXiv Feb 10, 2026 · 7w ago

Hayfa Dhahbi, Kashyap Thimmaraju · Technische Universität Berlin

Proposes Four-Checkpoint Framework and WASR metric to diagnose which LLM safety layers break under 13 prompt-level jailbreak techniques

Prompt Injection nlp

Papers in Database (1)