defense 2025

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Bingjie Zhang 1, Yibo Yang 2,3, Zhe Ren 1, Dandan Guo 1, Jindong Gu 3, Philip Torr 3, Bernard Ghanem 2

3 citations · 73 references · arXiv

α

Published on arXiv

2510.14301

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

GuardSpace reduces the harmful score of Llama-2-7B-Chat fine-tuned on GSM8K from 14.4% to 3.6% (vs. SOTA AsFT) while improving math accuracy from 26.0% to 28.0%.

GuardSpace

Novel technique introduced


Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.


Key Contributions

  • Covariance-preconditioned SVD decomposition of pre-trained weights into safety-relevant and safety-irrelevant components, with LoRA adapters initialized from safety-irrelevant components and safety-relevant components frozen
  • Null space projector that constrains adapter updates to avoid altering model outputs on harmful prompts, preserving refusal behavior during fine-tuning
  • GuardSpace reduces harmful response rate for Llama-2-7B-Chat on GSM8K from 14.4% to 3.6% vs. SOTA AsFT, while simultaneously improving task accuracy

🛡️ Threat Analysis

Transfer Learning Attack

GuardSpace directly defends against the transfer learning vulnerability where fine-tuning (including benign LoRA adaptation) degrades pre-trained safety alignment — a classic attack exploiting the gap between pre-training and fine-tuning distributions.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_box
Datasets
GSM8K
Applications
llm fine-tuningsafety alignment preservationlow-rank adaptation