defense 2025

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Jialin Wu ¹, Kecen Li ¹, Zhicong Huang ¹, Xinfeng Li ², Xiaofeng Wang ², Cheng Hong ¹

¹ Ant Group

² Nanyang Technological University

1 citations · arXiv

Published on arXiv

2511.09880

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

EnchTable achieves significantly lower unsafe rates and higher utility scores than six parameter modification baselines and two inference-time alignment methods across diverse task domains while resisting static and dynamic jailbreaking attacks.

EnchTable

Novel technique introduced

Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

Key Contributions

NTK-based safety vector distillation that decouples safety constraints from task-specific reasoning to enable alignment transfer across model architectures and sizes
Interference-aware merging technique that balances safety and utility when combining safety vectors with fine-tuned model weights
EnchTable framework evaluated on 11 datasets across 3 task domains and 3 LLM architectures, demonstrating lower unsafe rates than baselines while resisting both static and dynamic jailbreaking attacks

🛡️ Threat Analysis

Transfer Learning Attack

The paper's core problem is that fine-tuning degrades safety alignment — a transfer learning vulnerability where safety properties instilled during pre-training/RLHF do not survive the downstream fine-tuning process. EnchTable defends against this by decoupling and preserving safety constraints across the fine-tuning gap, directly targeting the pre-training-to-fine-tuning distribution mismatch that ML07 covers.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_time

Datasets

eleven diverse datasets (specific names not provided in available body)

Applications

code generationbiomedical analysismathematical problem solvingllm safety alignment

Read PDF arXiv DOI

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment

Understanding and Preserving Safety in Fine-Tuned LLMs

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Token-level Data Selection for Safe LLM Fine-tuning

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning