defense 2026

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Guoli Wang , Haonan Shi , Tu Ouyang , An Wang

0 citations

α

Published on arXiv

2603.07445

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Token-level confidence constraints on safety tokens prevent alignment drift during fine-tuning without imposing global parameter restrictions that trade off with model utility.

PACT (Preserving safety Alignment via Constrained Tokens)

Novel technique introduced


Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility.


Key Contributions

  • Empirical observation that safety-aligned behavior is concentrated in a small subset of safety-related tokens, motivating token-level intervention rather than model-wide restrictions
  • PACT framework that regularizes fine-tuned model confidence on safety tokens to match the aligned reference model, while leaving non-safety tokens unconstrained for task adaptation
  • Targeted fine-tuning approach that prevents alignment drift from both benign and harmful training data without sacrificing downstream task performance

🛡️ Threat Analysis

Transfer Learning Attack

The paper's primary defense target is safety alignment drift induced by the fine-tuning (transfer learning) process — including attacks where a small fraction of harmful data is injected into the fine-tuning dataset to compromise refusal behavior. PACT defends against attacks that exploit the gap between pre-training alignment and downstream fine-tuning.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Applications
llm fine-tuningllm safety alignmentrefusal behavior preservation