defense 2026

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel , Souvik Maji , Pratik Mazumder

0 citations · 57 references · arXiv (Cornell University)

α

Published on arXiv

2602.17546

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Adaptive regularization consistently lowers attack success rate across multiple model families and attack scenarios compared to standard fine-tuning while preserving downstream task performance and adding no inference-time cost.

Adaptive Safety Regularization

Novel technique introduced


Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.


Key Contributions

  • Adaptive regularization framework that dynamically adjusts KL-divergence penalty strength based on estimated safety risk of each training batch, avoiding the utility-safety trade-off of fixed penalties
  • Two complementary risk estimation approaches: a judge-based Safety Critic scoring training batch harmfulness at a semantic level, and a lightweight activation-based classifier predicting harmful intent from intermediate model representations before generation
  • Empirical demonstration that harmful intent is predictable from pre-generation activations, enabling cost-free inference-time safety with no added overhead

🛡️ Threat Analysis

Transfer Learning Attack

The paper directly defends against harmful fine-tuning attacks — adversarial or benign updates applied during the transfer learning (fine-tuning) stage that exploit the gap between pre-training alignment and post-fine-tuning behavior to strip safety guardrails. Adaptive regularization is specifically designed to constrain safety-compromising updates at fine-tuning time.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_boxtargeted
Applications
llm fine-tuning safetysafety alignment preservation