Zihan Wang

h-index: 6 463 citations 35 papers (total)

Papers in Database (2)

defense arXiv Jan 8, 2026 · 12w ago

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Hoagy Cunningham, Jerry Wei, Zihan Wang et al. · Anthropic

Defends LLMs against universal jailbreaks using cascaded exchange classifiers and linear probes, reducing costs 40x with near-zero refusal rate

Prompt Injection nlp
6 citations PDF
defense arXiv Feb 11, 2026 · 7w ago

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang et al. · The University of Queensland · CSIRO’s Data61 +1 more

Defends collaborative LLM training against gradient inversion by replacing tokens with semantically disconnected yet embedding-proximate shadow substitutes

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning
PDF