defense 2026

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Yue Zhao 1, Yujia Gong 1, Ruigang Liang 1, Shenchen Zhu 1, Kai Chen 1, Xuejing Yuan 2, Wangjun Zhang 3

0 citations

α

Published on arXiv

2603.18449

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves targeted safety-oriented functionality transfer with less than 1% performance degradation for most models, consistently outperforming five baselines across safety disalignment, alignment enhancement, and bias removal tasks

CNT (Cross-Model Neuron Transfer)

Novel technique introduced


The widespread deployment of large language models (LLMs) calls for post-hoc methods that can flexibly adapt models to evolving safety requirements. Meanwhile, the rapidly expanding open-source LLM ecosystem has produced a diverse collection of models that already exhibit various safety-related functionalities. This motivates a shift from constructing safety functionality from scratch to reusing existing functionality from external models, thereby avoiding costly data collection and training procedures. In this paper, we present Cross-Model Neuron Transfer (CNT), a post-hoc method that reuses safety-oriented functionality by transferring a minimal subset of neurons from an open-source donor LLM to a target LLM. By operating at the neuron level, CNT enables modular function-level adaptation, supporting both function addition andfunction deletion. We evaluate CNT on seven popular LLMs across three representative applications: safety disalignment, alignment enhancement, and bias removal. Experimental results show that CNT achieves targeted safety-oriented functionality transfer with minimal performance degradation (less than 1% for most models), consistently outperforming five baselines, demonstrating its generality and practical effectiveness.


Key Contributions

  • Cross-Model Neuron Transfer (CNT) method that reuses safety functionality by transferring minimal neuron subsets between LLMs
  • Supports both function addition (alignment enhancement, jailbreak resistance) and function deletion (bias removal) at neuron granularity
  • Achieves targeted safety transfer with <1% performance degradation across 7 LLMs, outperforming 5 baselines

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
llm safety alignmentjailbreak defensebias removal