defense 2025

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang , Derek Liu , Kai Zhang , Joshua Franco , Haihao Liu

AlgoVerseAI Research

0 citations · 33 references · arXiv (Cornell University)

Published on arXiv

2602.11157

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Standard response-based KD on teacher safe refusals inadvertently increases jailbreak success rate for all student models (up to 16.6 pp), with safety degradation primarily attributable to nuanced 'boundary' refusal data.

Response-Based Knowledge Distillation with LoRA PEFT

Novel technique introduced

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

Key Contributions

First empirical study applying response-based knowledge distillation with LoRA PEFT as a multilingual jailbreak defense, revealing it paradoxically increases jailbreak success rate by up to 16.6 percentage points
Failure analysis attributing safety degradation to three factors: nuanced 'boundary' refusals, amplification of teacher vulnerabilities, and catastrophic forgetting
Preliminary data purification experiment showing that removing boundary refusals mitigates or reverses safety degradation in two of three student models

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically studies how the transfer learning process (response-based KD with LoRA PEFT) affects safety alignment, finding that fine-tuning on safe teacher refusals inadvertently amplifies jailbreak vulnerability — directly relevant to how fine-tuning/adapter tuning can undermine model safety, an ML07 phenomenon.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

XSafetyMultiJailGSM8K

Applications

multilingual llm safety alignmentjailbreak prevention

Read PDF arXiv DOI Code

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Locket: Robust Feature-Locking Technique for Language Models

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

In-Training Defenses against Emergent Misalignment in Language Models

Understanding and Preserving Safety in Fine-Tuned LLMs