defense 2025

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han ^1,2, Feifei Zhao ^2,3,4, Dongcheng Zhao ^2,3,4, Guobin Shen ^2,1, Ping Wu ^2,1, Yu Shi ¹, Yi Zeng ^2,3,1,4

¹ University of Chinese Academy of Sciences

² Chinese Academy of Sciences

³ Beijing Institute of AI Safety and Governance

⁴ Long-term AI

0 citations

Published on arXiv

2508.09190

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MSCP reduces harmfulness scores close to the minimum of 1 and lowers jailbreak attack success rates with as few as 4.67% parameter modifications on Qwen while preserving task utility.

MSCP (Multi-Level Safety Continual Projection)

Novel technique introduced

While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.

Key Contributions

Multi-Level Safety Continual Projection (MSCP): a training-free post-fine-tuning method that aligns global and localized safety activations to isolate sparse safety-critical neuron clusters and applies composable safety-direction projections without retraining.
Achieves near-minimum harmfulness scores with minimal parameter modifications (e.g., 4.67% for Qwen) while preserving downstream task utility and improving human preference alignment.
Task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense against newly emerging safety dimensions (e.g., only 0.75% additional parameters for terrorism dimension).

🛡️ Threat Analysis

Transfer Learning Attack

The paper's core threat model is that fine-tuning (transfer learning) degrades safety alignment representations even on benign data, creating exploitable vulnerabilities — MSCP is a post-fine-tuning defense specifically addressing safety degradation through the fine-tuning/transfer process.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxtraining_timeinference_time

Applications

fine-tuned llmssemantic qamathematical reasoning

Read PDF arXiv

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Understanding and Preserving Safety in Fine-Tuned LLMs

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance