Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining
Bing Han 1,2, Feifei Zhao 2,3,4, Dongcheng Zhao 2,3,4, Guobin Shen 2,1, Ping Wu 2,1, Yu Shi 1, Yi Zeng 2,3,1,4
Published on arXiv
2508.09190
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
MSCP reduces harmfulness scores close to the minimum of 1 and lowers jailbreak attack success rates with as few as 4.67% parameter modifications on Qwen while preserving task utility.
MSCP (Multi-Level Safety Continual Projection)
Novel technique introduced
While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.
Key Contributions
- Multi-Level Safety Continual Projection (MSCP): a training-free post-fine-tuning method that aligns global and localized safety activations to isolate sparse safety-critical neuron clusters and applies composable safety-direction projections without retraining.
- Achieves near-minimum harmfulness scores with minimal parameter modifications (e.g., 4.67% for Qwen) while preserving downstream task utility and improving human preference alignment.
- Task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense against newly emerging safety dimensions (e.g., only 0.75% additional parameters for terrorism dimension).
🛡️ Threat Analysis
The paper's core threat model is that fine-tuning (transfer learning) degrades safety alignment representations even on benign data, creating exploitable vulnerabilities — MSCP is a post-fine-tuning defense specifically addressing safety degradation through the fine-tuning/transfer process.