GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.

Key Contributions

A tailored extraction strategy to synthesize domain-specific safety alignment data from LLMs, validated theoretically and empirically as a reliable proxy for original alignment data.
GR-SAP framework that mixes self-synthesized safety data with task-specific data during fine-tuning to preserve safety alignment without requiring access to the original alignment dataset.
Comprehensive evaluation across four model families and five downstream benchmarks showing substantial reduction in harmful response ratios (e.g., LLaMA from 6.28% to 0.58%) while maintaining downstream performance.

🛡️ Threat Analysis

Transfer Learning Attack

The core threat modeled is that the transfer learning / fine-tuning process inadvertently (or intentionally) erodes safety alignment. GR-SAP is a defense specifically designed to counter this fine-tuning-induced safety degradation, making it a direct ML07 defense.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_time

Datasets

AdvBenchHarmBench

Applications

2025 0 cit.

Transfer Learning Attack

92%