defense 2026

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang 1, Jiawei Zhou 2, Hanjie Chen 1

0 citations

α

Published on arXiv

2603.10243

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

GR-SAP reduces LLaMA's harmful response ratio from 6.28% to 0.58% versus the unmixed fine-tuning baseline while maintaining comparable downstream task performance.

GR-SAP

Novel technique introduced


Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.


Key Contributions

  • A tailored extraction strategy to synthesize domain-specific safety alignment data from LLMs, validated theoretically and empirically as a reliable proxy for original alignment data.
  • GR-SAP framework that mixes self-synthesized safety data with task-specific data during fine-tuning to preserve safety alignment without requiring access to the original alignment dataset.
  • Comprehensive evaluation across four model families and five downstream benchmarks showing substantial reduction in harmful response ratios (e.g., LLaMA from 6.28% to 0.58%) while maintaining downstream performance.

🛡️ Threat Analysis

Transfer Learning Attack

The core threat modeled is that the transfer learning / fine-tuning process inadvertently (or intentionally) erodes safety alignment. GR-SAP is a defense specifically designed to counter this fine-tuning-induced safety degradation, making it a direct ML07 defense.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Datasets
AdvBenchHarmBench
Applications
large language model fine-tuningdomain adaptationinstruction tuning