defense 2026

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

Haozhong Wang , Zhuo Li , Yibo Yang , He Zhao , Hongyuan Zha , Dandan Guo

0 citations · 61 references · arXiv

α

Published on arXiv

2601.07200

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SOT achieves a superior safety-utility trade-off over heuristic and optimization-based baselines by purifying fine-tuning data via distributional alignment rather than instance-level filtering.

Safety Optimal Transport (SOT)

Novel technique introduced


The inherent safety alignment of Large Language Models (LLMs) is prone to erosion during fine-tuning, even when using seemingly innocuous datasets. While existing defenses attempt to mitigate this via data selection, they typically rely on heuristic, instance-level assessments that neglect the global geometry of the data distribution and fail to explicitly repel harmful patterns. To address this, we introduce Safety Optimal Transport (SOT), a novel framework that reframes safe fine-tuning from an instance-level filtering challenge to a distribution-level alignment task grounded in Optimal Transport (OT). At its core is a dual-reference ``push-pull'' weight-learning mechanism: SOT optimizes sample importance by actively pulling the downstream distribution towards a trusted safe anchor while simultaneously pushing it away from a general harmful reference. This establishes a robust geometric safety boundary that effectively purifies the training data. Extensive experiments across diverse model families and domains demonstrate that SOT significantly enhances model safety while maintaining competitive downstream performance, achieving a superior safety-utility trade-off compared to baselines.


Key Contributions

  • Safety Optimal Transport (SOT) framework that reframes safe fine-tuning as a distribution-level alignment task using Optimal Transport theory rather than instance-level heuristic filtering
  • Dual-reference push-pull weight-learning mechanism that simultaneously pulls the fine-tuning distribution toward a trusted safe anchor and pushes it away from a harmful reference distribution
  • Demonstrates superior safety-utility trade-off compared to existing baselines (SAFT, SEAL, etc.) across diverse LLM families and downstream domains

🛡️ Threat Analysis

Transfer Learning Attack

The paper explicitly addresses fine-tuning-stage safety erosion — a transfer learning vulnerability where the fine-tuning process degrades pre-trained safety alignment (RLHF/DPO), even with benign data. SOT is a defense against attacks exploiting the pre-training to fine-tuning distribution gap.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Applications
llm fine-tuningconversational ai safetyinstruction tuning