Kai-Wei Chang

h-index: 8 204 citations 20 papers (total)

Papers in Database (1)

defense arXiv Nov 18, 2025 · Nov 2025

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna et al. · University of Southern California · Amazon AGI

Reveals that narrow refusal unlearning on LLMs triggers emergent misalignment in unrelated safety domains, and proposes a retain-data defense to contain it

Transfer Learning Attack Prompt Injection nlp
3 citations PDF