defense 2026

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Guanghao Zhou 1,2, Panjia Qiu 1,2, Cen Chen 1, Hongyu Li 2, Mingyuan Chu 2, Xin Zhang 2, Jun Zhou 2

3 citations · 1 influential · 44 references · arXiv

α

Published on arXiv

2602.00038

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LSSF post-hoc re-alignment restores safety of fine-tuned Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct with minimal impact on downstream task performance, without requiring additional fine-tuning.

LSSF (Low-Rank Safety Subspace Fusion)

Novel technique introduced


The safety mechanisms of large language models (LLMs) exhibit notable fragility, as even fine-tuning on datasets without harmful content may still undermine their safety capabilities. Meanwhile, existing safety alignment methods predominantly rely on the fine-tuning process, which inadvertently leads to the increased complexity and computational resources required. To address these issues, we introduce LSSF, a novel safety re-alignment framework with \underline{L}ow-Rank \underline{S}afety \underline{S}ubspace \underline{F}usion. Our proposed method exploits the low-rank characteristics of safety information in LLMs by constructing a low-rank projection matrix to extract the principal components of safety vectors. Notably, this projection matrix represents the low-rank safety subspace of the LLMs, which we have observed to remain stable during fine-tuning process and is isolated from the model's general capabilities. These principal components are used to effectively restore safety alignment when combined with fine-tuned LLMs through linear arithmetic. Additionally, to account for the varying encoding densities of safety information across different layers of LLMs, we propose a novel metric called safety singular value entropy. This metric quantifies the encoding density and allows for the dynamic computation of the safety-critical rank for each safety vector. Extensive experiments demonstrate that our proposed post-hoc alignment method can effectively restore the safety alignment of fine-tuned models with minimal impact on their performance in downstream tasks.


Key Contributions

  • Low-rank projection matrix construction to extract principal components of safety vectors for post-hoc re-alignment via linear arithmetic with fine-tuned models
  • Safety singular value entropy metric that quantifies safety information encoding density per layer to dynamically determine the safety-critical rank for each vector
  • Post-hoc alignment framework (LSSF) that restores safety of fine-tuned LLMs without requiring additional training, with minimal downstream task performance degradation

🛡️ Threat Analysis

Transfer Learning Attack

The paper's primary threat model is that fine-tuning (transfer learning) degrades LLM safety alignment — even on benign data. LSSF is a defense that specifically targets the safety drift introduced during fine-tuning, restoring alignment by projecting back into a pre-identified low-rank safety subspace. This directly addresses the fine-tuning vulnerability angle of transfer learning attacks.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Datasets
evaluated on Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct with unspecified downstream benchmarks
Applications
llm safety alignmentfine-tuned language models