defense 2026

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Guanghao Zhou ^1,2, Panjia Qiu ^1,2, Cen Chen ¹, Hongyu Li ², Mingyuan Chu ², Xin Zhang ², Jun Zhou ²

¹ East China Normal University

² Ant Group

3 citations · 1 influential · 44 references · arXiv

Published on arXiv

2602.00038

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LSSF post-hoc re-alignment restores safety of fine-tuned Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct with minimal impact on downstream task performance, without requiring additional fine-tuning.

LSSF (Low-Rank Safety Subspace Fusion)

Novel technique introduced

The safety mechanisms of large language models (LLMs) exhibit notable fragility, as even fine-tuning on datasets without harmful content may still undermine their safety capabilities. Meanwhile, existing safety alignment methods predominantly rely on the fine-tuning process, which inadvertently leads to the increased complexity and computational resources required. To address these issues, we introduce LSSF, a novel safety re-alignment framework with \underline{L}ow-Rank \underline{S}afety \underline{S}ubspace \underline{F}usion. Our proposed method exploits the low-rank characteristics of safety information in LLMs by constructing a low-rank projection matrix to extract the principal components of safety vectors. Notably, this projection matrix represents the low-rank safety subspace of the LLMs, which we have observed to remain stable during fine-tuning process and is isolated from the model's general capabilities. These principal components are used to effectively restore safety alignment when combined with fine-tuned LLMs through linear arithmetic. Additionally, to account for the varying encoding densities of safety information across different layers of LLMs, we propose a novel metric called safety singular value entropy. This metric quantifies the encoding density and allows for the dynamic computation of the safety-critical rank for each safety vector. Extensive experiments demonstrate that our proposed post-hoc alignment method can effectively restore the safety alignment of fine-tuned models with minimal impact on their performance in downstream tasks.

Key Contributions

Low-rank projection matrix construction to extract principal components of safety vectors for post-hoc re-alignment via linear arithmetic with fine-tuned models
Safety singular value entropy metric that quantifies safety information encoding density per layer to dynamically determine the safety-critical rank for each vector
Post-hoc alignment framework (LSSF) that restores safety of fine-tuned LLMs without requiring additional training, with minimal downstream task performance degradation

🛡️ Threat Analysis

Transfer Learning Attack

The paper's primary threat model is that fine-tuning (transfer learning) degrades LLM safety alignment — even on benign data. LSSF is a defense that specifically targets the safety drift introduced during fine-tuning, restoring alignment by projecting back into a pre-identified low-rank safety subspace. This directly addresses the fine-tuning vulnerability angle of transfer learning attacks.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_time

Datasets

evaluated on Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct with unspecified downstream benchmarks

Applications

llm safety alignmentfine-tuned language models

Read PDF arXiv DOI

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

Token-level Data Selection for Safe LLM Fine-tuning

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space