defense 2025

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Guozhi Liu 1, Qi Mu 1,2, Tiansheng Huang 3, Xinhua Wang 1, Li Shen 4, Weiwei Lin 1,5, Zhang Li 6

2 citations · 1 influential · 40 references · arXiv

α

Published on arXiv

2510.10085

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Pharmacist reduces alignment training time by ~57% while improving defense performance by up to 3.30% and inference performance by up to 3.50% when combined with RepNoise and T-Vaccine defenses.

Pharmacist

Novel technique introduced


Harmful fine-tuning issues present significant safety challenges for fine-tuning-as-a-service in large language models. Existing alignment-stage defenses, e.g., Vaccine, Repnoise, Booster, and T-Vaccine, mitigate harmful fine-tuning issues by enhancing the model's robustness during the alignment phase. While these methods have been proposed to mitigate the issue, they often overlook a critical upstream factor: the role of the original safety-alignment data. We observe that their defense performance and computational efficiency remain constrained by the quality and composition of the alignment dataset. To address this limitation, we propose Pharmacist, a safety alignment data curation solution that enhances defense against harmful fine-tuning by selecting a high-quality and safety-critical core subset from the original alignment data. The core idea of Pharmacist is to train an alignment data selector to rank alignment data. Specifically, up-ranking high-quality and safety-critical alignment data, down-ranking low-quality and non-safety-critical data. Empirical results indicate that models trained on datasets selected by Pharmacist outperform those trained on datasets selected by existing selection methods in both defense and inference performance. In addition, Pharmacist can be effectively integrated with mainstream alignment-stage defense methods. For example, when applied to RepNoise and T-Vaccine, using the dataset selected by Pharmacist instead of the full dataset leads to improvements in defense performance by 2.60\% and 3.30\%, respectively, and enhances inference performance by 3.50\% and 1.10\%. Notably, it reduces training time by 56.83\% and 57.63\%, respectively. Our code is available at https://github.com/Lslland/Pharmacist.


Key Contributions

  • Proposes Pharmacist, a data selector that ranks and curates safety-critical, high-quality subsets from alignment datasets to enhance defense against harmful fine-tuning
  • Demonstrates that Pharmacist integrates with existing alignment-stage defenses (RepNoise, T-Vaccine) improving defense performance by 2.60–3.30% and inference performance by 1.10–3.50%
  • Reduces alignment training time by approximately 57% by discarding low-quality, non-safety-critical alignment samples without sacrificing defense performance

🛡️ Threat Analysis

Transfer Learning Attack

Harmful fine-tuning is a transfer learning attack that exploits the fine-tuning process to override safety alignment. The paper defends against this by improving the upstream alignment data quality so the model remains robust when subsequently fine-tuned on adversarial data.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Applications
large language modelsfine-tuning-as-a-servicesafety alignment