Zhenhong Zhou

defense arXiv Sep 29, 2025 · Sep 2025

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li, Zheng Nie, Zhenhong Zhou et al. · Beijing University of Posts and Telecommunications · National University of Singapore +5 more

Defends diffusion LLMs against jailbreaks by fixing greedy remasking bias and block-level autonomous safety repair

Prompt Injection nlp

3 citations 2 influentialPDF Code

defense arXiv Sep 26, 2025 · Sep 2025

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily et al. · University of Science and Technology of China · Nanyang Technological University +5 more

Mechanistic interpretability framework identifies backdoor-responsible attention heads in LLMs, enabling surgical neutralization or amplification of backdoor behavior

Model Poisoning nlp

1 citations PDF

benchmark arXiv Jan 2, 2026 · Jan 2026

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +1 more

Benchmarks lightweight LLM safety against Chinese jailbreak patterns like homophones, pinyin encoding, and symbol splitting

Prompt Injection nlp

PDF

defense arXiv Oct 11, 2025 · Oct 2025

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Liang Lin, Miao Yu, Moayad Aloqaily et al. · Nanyang Technological University · University of Science and Technology of China +4 more

Defends LLMs against unknown backdoors by intentionally injecting known triggers to aggregate and then purge backdoor representations

Model Poisoning nlp

PDF

Papers in Database (4)

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models