defense 2025

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

2 citations · 1 influential · 40 references · arXiv

Published on arXiv

2510.10085

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Pharmacist reduces alignment training time by ~57% while improving defense performance by up to 3.30% and inference performance by up to 3.50% when combined with RepNoise and T-Vaccine defenses.

Pharmacist

Novel technique introduced

Harmful fine-tuning issues present significant safety challenges for fine-tuning-as-a-service in large language models. Existing alignment-stage defenses, e.g., Vaccine, Repnoise, Booster, and T-Vaccine, mitigate harmful fine-tuning issues by enhancing the model's robustness during the alignment phase. While these methods have been proposed to mitigate the issue, they often overlook a critical upstream factor: the role of the original safety-alignment data. We observe that their defense performance and computational efficiency remain constrained by the quality and composition of the alignment dataset. To address this limitation, we propose Pharmacist, a safety alignment data curation solution that enhances defense against harmful fine-tuning by selecting a high-quality and safety-critical core subset from the original alignment data. The core idea of Pharmacist is to train an alignment data selector to rank alignment data. Specifically, up-ranking high-quality and safety-critical alignment data, down-ranking low-quality and non-safety-critical data. Empirical results indicate that models trained on datasets selected by Pharmacist outperform those trained on datasets selected by existing selection methods in both defense and inference performance. In addition, Pharmacist can be effectively integrated with mainstream alignment-stage defense methods. For example, when applied to RepNoise and T-Vaccine, using the dataset selected by Pharmacist instead of the full dataset leads to improvements in defense performance by 2.60\% and 3.30\%, respectively, and enhances inference performance by 3.50\% and 1.10\%. Notably, it reduces training time by 56.83\% and 57.63\%, respectively. Our code is available at https://github.com/Lslland/Pharmacist.

Key Contributions

Proposes Pharmacist, a data selector that ranks and curates safety-critical, high-quality subsets from alignment datasets to enhance defense against harmful fine-tuning
Demonstrates that Pharmacist integrates with existing alignment-stage defenses (RepNoise, T-Vaccine) improving defense performance by 2.60–3.30% and inference performance by 1.10–3.50%
Reduces alignment training time by approximately 57% by discarding low-quality, non-safety-critical alignment samples without sacrificing defense performance

🛡️ Threat Analysis

Transfer Learning Attack

Harmful fine-tuning is a transfer learning attack that exploits the fine-tuning process to override safety alignment. The paper defends against this by improving the upstream alignment data quality so the model remains robust when subsequently fine-tuned on adversarial data.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeblack_box

Applications

large language modelsfine-tuning-as-a-servicesafety alignment

Read PDF arXiv DOI Code

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

In-Training Defenses against Emergent Misalignment in Language Models

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Token-level Data Selection for Safe LLM Fine-tuning

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning