Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
Guoli Wang , Haonan Shi , Tu Ouyang , An Wang
Published on arXiv
2603.07445
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Token-level confidence constraints on safety tokens prevent alignment drift during fine-tuning without imposing global parameter restrictions that trade off with model utility.
PACT (Preserving safety Alignment via Constrained Tokens)
Novel technique introduced
Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility.
Key Contributions
- Empirical observation that safety-aligned behavior is concentrated in a small subset of safety-related tokens, motivating token-level intervention rather than model-wide restrictions
- PACT framework that regularizes fine-tuned model confidence on safety tokens to match the aligned reference model, while leaving non-safety tokens unconstrained for task adaptation
- Targeted fine-tuning approach that prevents alignment drift from both benign and harmful training data without sacrificing downstream task performance
🛡️ Threat Analysis
The paper's primary defense target is safety alignment drift induced by the fine-tuning (transfer learning) process — including attacks where a small fraction of harmful data is injected into the fine-tuning dataset to compromise refusal behavior. PACT defends against attacks that exploit the gap between pre-training alignment and downstream fine-tuning.