Jerry Wei

h-index: 2 116 citations 3 papers (total)

Papers in Database (1)

defense arXiv Jan 8, 2026 · 12w ago

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham, Jerry Wei, Zihan Wang et al. · Anthropic

Defends LLMs against universal jailbreaks using cascaded exchange classifiers and linear probes, reducing costs 40x with near-zero refusal rate

Prompt Injection nlp
6 citations PDF