Derek Liu

h-index: 4 56 citations 7 papers (total)

Papers in Database (1)

defense arXiv Dec 8, 2025 · Dec 2025

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang, Derek Liu, Kai Zhang et al. · AlgoVerseAI Research

Knowledge distillation of safe refusal behaviors into LLMs counterintuitively increases multilingual jailbreak success by up to 16.6 points

Transfer Learning Attack Prompt Injection nlp
PDF Code