Ziwei Xu

h-index: 3 29 citations 10 papers (total)

Papers in Database (1)

attack arXiv Jan 27, 2026 · 9w ago

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

Yangyang Guo, Ziwei Xu, Si Liu et al. · National University of Singapore · Beihang University

Fine-tunes LLMs on 1,000 benign samples with refusal prefixes to erase safety alignment across 16 models including GPT and Gemini

Transfer Learning Attack Prompt Injection nlp
PDF Code