defense 2026

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Yue Zhao ¹, Yujia Gong ¹, Ruigang Liang ¹, Shenchen Zhu ¹, Kai Chen ¹, Xuejing Yuan ², Wangjun Zhang ³

¹ Chinese Academy of Sciences

² Beijing University of Posts and Telecommunications

³ Guangzhou University

0 citations

Published on arXiv

2603.18449

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves targeted safety-oriented functionality transfer with less than 1% performance degradation for most models, consistently outperforming five baselines across safety disalignment, alignment enhancement, and bias removal tasks

CNT (Cross-Model Neuron Transfer)

Novel technique introduced

The widespread deployment of large language models (LLMs) calls for post-hoc methods that can flexibly adapt models to evolving safety requirements. Meanwhile, the rapidly expanding open-source LLM ecosystem has produced a diverse collection of models that already exhibit various safety-related functionalities. This motivates a shift from constructing safety functionality from scratch to reusing existing functionality from external models, thereby avoiding costly data collection and training procedures. In this paper, we present Cross-Model Neuron Transfer (CNT), a post-hoc method that reuses safety-oriented functionality by transferring a minimal subset of neurons from an open-source donor LLM to a target LLM. By operating at the neuron level, CNT enables modular function-level adaptation, supporting both function addition andfunction deletion. We evaluate CNT on seven popular LLMs across three representative applications: safety disalignment, alignment enhancement, and bias removal. Experimental results show that CNT achieves targeted safety-oriented functionality transfer with minimal performance degradation (less than 1% for most models), consistently outperforming five baselines, demonstrating its generality and practical effectiveness.

Key Contributions

Cross-Model Neuron Transfer (CNT) method that reuses safety functionality by transferring minimal neuron subsets between LLMs
Supports both function addition (alignment enhancement, jailbreak resistance) and function deletion (bias removal) at neuron granularity
Achieves targeted safety transfer with <1% performance degradation across 7 LLMs, outperforming 5 baselines

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

llm safety alignmentjailbreak defensebias removal

Read PDF arXiv

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

GAVEL: Towards rule-based safety through activation monitoring

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration