defense 2025

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Xin Yi , Yue Li , Dongsheng Shi , Linlin Wang , Xiaoling Wang , Liang He

1 citations · 77 references · arXiv

α

Published on arXiv

2511.14423

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TSSF defends against all 8 tested jailbreak strategies while preserving utility for benign queries, and robustly resists fine-tuning-based safety removal attacks across three benchmark datasets.

TSSF (Three-Stage Shield Framework)

Novel technique introduced


Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.


Key Contributions

  • EduHarm benchmark: safe–unsafe instruction pairs across five educational scenarios (teaching, learning, administration, assessment, research) for systematic safety evaluation of educational LLMs
  • TSSF (Three-Stage Shield Framework): safety-aware attention realignment that restores harmfulness features, layer-wise safety judgment aggregating multi-layer cues, and defense-driven dual routing for safe/unsafe query separation
  • Unified defense against both jailbreak and fine-tuning attacks simultaneously, evaluated across 8 jailbreak strategies and 3 fine-tuning attack datasets

🛡️ Threat Analysis

Transfer Learning Attack

The paper explicitly defends against fine-tuning attacks — adversaries fine-tune safety-aligned LLMs on harmful data to remove safety alignment. This directly exploits the transfer learning/fine-tuning process, the core of ML07. TSSF evaluates robustness against three fine-tuning attack datasets (e.g., Qi et al.'s fine-tuning alignment bypass).


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_timeblack_box
Datasets
EduHarmToxicChatSALAD-BenchBeaverTails
Applications
educational llmsintelligent tutoring systemsautomated gradingpersonalized learning assistants