defense 2025

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

5 citations · 1 influential · 82 references · arXiv

Published on arXiv

2510.27172

Data Poisoning Attack

OWASP ML Top 10 — ML02

Transfer Learning Attack

OWASP ML Top 10 — ML07

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

BDS achieves state-of-the-art defense against harmful fine-tuning across diverse attack and defense settings without relying on attack simulation.

Bayesian Data Scheduler (BDS)

Novel technique introduced

Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at https://github.com/Egg-Hu/Bayesian-Data-Scheduler.

Key Contributions

Bayesian Data Scheduler (BDS) that formulates harmful fine-tuning defense as Bayesian inference over per-datapoint safety attributes, conditioned on both the fine-tuning and alignment datasets.
Adaptive, simulation-free defense that tailors itself to the specific fine-tuning dataset without requiring a predefined attack model, overcoming the bounded-threat-model limitation of prior approaches.
Neural scheduler using amortized Bayesian learning for efficient transfer to new data distributions without retraining.

🛡️ Threat Analysis

Data Poisoning Attack

The threat is adversarial poisoning of the fine-tuning dataset to degrade LLM safety alignment; BDS defends by learning a posterior over each data point's safety attribute to down-weight harmful data during training.

Transfer Learning Attack

The attack specifically exploits the fine-tuning/transfer-learning stage of pre-trained, safety-aligned LLMs — harmful data submitted to a fine-tuning-as-a-service provider undoes the model's safety alignment, which is a transfer learning attack vector.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timegrey_box

Applications

fine-tuning-as-a-servicellm safety alignment

Read PDF arXiv DOI Code

Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Understanding and Mitigating Dataset Corruption in LLM Steering

Gradient Surgery for Safe LLM Fine-Tuning

Subliminal Signals in Preference Labels

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Towards Poisoning Robustness Certification for Natural Language Generation

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs