Zeming Wei

Papers in Database (2)

defense arXiv Aug 21, 2025 · Aug 2025

Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection

Chengcan Wu, Zeming Wei, Huanran Chen et al. · Peking University · Tsinghua University

Proposes irreversible hidden-state projections in LLMs to permanently erase harmful knowledge and resist adversarial relearning attacks

Transfer Learning Attack Prompt Injection nlp
PDF Code
benchmark arXiv Sep 4, 2025 · Sep 2025

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang, Zeming Wei, Qin Liu et al. · National University of Singapore · Peking University +1 more

Probing-based LLM safety detectors learn surface patterns not semantic harm, failing badly on out-of-distribution malicious inputs

Prompt Injection nlp
PDF Code