defense 2025

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han 1,2, Feifei Zhao 2,3,4, Dongcheng Zhao 2,3,4, Guobin Shen 2,1, Ping Wu 2,1, Yu Shi 1, Yi Zeng 2,3,1,4

0 citations

α

Published on arXiv

2508.09190

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MSCP reduces harmfulness scores close to the minimum of 1 and lowers jailbreak attack success rates with as few as 4.67% parameter modifications on Qwen while preserving task utility.

MSCP (Multi-Level Safety Continual Projection)

Novel technique introduced


While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.


Key Contributions

  • Multi-Level Safety Continual Projection (MSCP): a training-free post-fine-tuning method that aligns global and localized safety activations to isolate sparse safety-critical neuron clusters and applies composable safety-direction projections without retraining.
  • Achieves near-minimum harmfulness scores with minimal parameter modifications (e.g., 4.67% for Qwen) while preserving downstream task utility and improving human preference alignment.
  • Task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense against newly emerging safety dimensions (e.g., only 0.75% additional parameters for terrorism dimension).

🛡️ Threat Analysis

Transfer Learning Attack

The paper's core threat model is that fine-tuning (transfer learning) degrades safety alignment representations even on benign data, creating exploitable vulnerabilities — MSCP is a post-fine-tuning defense specifically addressing safety degradation through the fine-tuning/transfer process.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timeinference_time
Applications
fine-tuned llmssemantic qamathematical reasoning