defense 2026

Understanding and Preserving Safety in Fine-Tuned LLMs

0 citations · 56 references · arXiv

Published on arXiv

2601.10141

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SPF recovers nearly all pre-trained safety alignment even under adversarial fine-tuning while maintaining downstream task performance on models including Llama, Mistral, and Qwen.

SPF (Safety-Preserving Fine-Tuning)

Novel technique introduced

Fine-tuning is an essential and pervasive functionality for applying large language models (LLMs) to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination. In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank subspace, while utility gradients span a broader high-dimensional space; (II) these subspaces are often negatively correlated, causing directional conflicts during fine-tuning; and (III) the dominant safety direction can be efficiently estimated from a single sample. Building upon these novel insights, we propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace. Theoretically, we show that SPF guarantees utility convergence while bounding safety drift. Empirically, SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios. Furthermore, SPF exhibits robust resistance to both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning.

Key Contributions

Three geometric insights about safety vs. utility gradient subspaces in aligned LLMs: safety gradients are low-rank, subspaces are negatively correlated, and the dominant safety direction can be estimated from a single sample.
Safety-Preserving Fine-Tuning (SPF): a lightweight method that removes utility gradient components conflicting with the low-rank safety subspace, with theoretical convergence and safety-drift bounds.
Empirical demonstration that SPF recovers nearly all pre-trained safety alignment under both benign and adversarial fine-tuning scenarios while maintaining downstream task performance.

🛡️ Threat Analysis

Transfer Learning Attack

The paper's primary contribution directly targets the fine-tuning/transfer learning process: it analyzes how fine-tuning degrades safety alignment and proposes SPF as a defense that explicitly operates on gradient subspaces during fine-tuning to prevent safety drift. This is squarely about securing the transfer learning process.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_time

Datasets

GSM8KSQL Create

Applications

llm fine-tuningsafety alignment preservationjailbreak defense

Read PDF arXiv DOI Code

Understanding and Preserving Safety in Fine-Tuned LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint