Token-level Data Selection for Safe LLM Fine-tuning

Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.

Key Contributions

First systematic token-level diagnosis showing that safety-degrading and utility-enhancing signals are intertwined at the token level during fine-tuning, revealing fundamental limitations of sample-level defenses
TOSS framework that scores each token's safety risk using the loss difference between a safety-degraded reference model and a utility-oriented reference model, enabling precise removal of unsafe tokens
TOSS-Pro progressive refinement strategy that iteratively improves the safety-degraded model's ability to identify unsafe tokens using increasingly higher-quality supervision

🛡️ Threat Analysis

Transfer Learning Attack

The paper directly addresses safety degradation that occurs specifically during LLM fine-tuning (transfer learning). TOSS defends against attacks exploiting the fine-tuning process to erode safety alignment, including adversarial harmful data and benign-yet-harmful data — both operating through the transfer/fine-tuning pipeline.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_time

Applications

2025 0 cit.

Transfer Learning Attack

92%

Token-level Data Selection for Safe LLM Fine-tuning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection