defense 2025

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum ¹, Hani Itani ², Hasan Abed Al Kader Hammoud ¹, Philip Torr ¹, Adel Bibi ², Bernard Ghanem ¹

¹ King Abdullah University of Science and Technology

² University of Oxford

2 citations · 50 references · arXiv

Published on arXiv

2512.10150

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

DER (continual learning) consistently achieves lower attack success rates than standard fine-tuning and outperforms existing safety-preserving baselines across LLaMA2-7B, Mistral-7B, and Gemma-2B under both benign and poisoned fine-tuning conditions.

DER

Novel technique introduced

The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

Key Contributions

Frames LLM safety preservation during fine-tuning as a continual learning problem, connecting catastrophic forgetting of safety alignment to CL literature
Systematically evaluates regularization-based, memory-based, and model merging CL approaches under both benign and poisoned fine-tuning scenarios in a fine-tuning-as-a-service setup
Demonstrates that DER (Dark Experience Replay) outperforms existing safety-preserving baselines while maintaining task utility across three LLM families and three downstream tasks

🛡️ Threat Analysis

Transfer Learning Attack

The paper directly studies how fine-tuning (transfer learning) degrades safety alignment in LLMs — a gap between pre-training safety and post-fine-tuning behavior — and proposes defenses (CL approaches) to counter this degradation. Both benign and adversarially poisoned fine-tuning are evaluated as attack vectors.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timegrey_box

Datasets

GSM8KSST2Code

Applications

fine-tuning-as-a-servicelarge language model fine-tuning

Read PDF arXiv DOI

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

Token-level Data Selection for Safe LLM Fine-tuning

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space