BeDKD: Backdoor Defense Based on Directional Mapping Module and Adversarial Knowledge Distillation

Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel \textbf{B}ackdoor d\textbf{e}fense method based on \textbf{D}irectional mapping module and adversarial \textbf{K}nowledge \textbf{D}istillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98$\%$ without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.

Key Contributions

Directional Mapping Module (DMM) that fine-tunes on label-flipped clean data to disrupt clean mapping and identify poisoned samples from the training set
Adversarial Knowledge Distillation (AKD) with a cycle iteration mechanism alternating between trust distillation (reinforcing clean mapping) and punish distillation (suppressing backdoor mapping) using minimal clean and poisoned data
BeDKD achieves 98% ASR reduction across SST2, OLID, and AGnews while maintaining competitive clean accuracy, outperforming state-of-the-art backdoor defenses with limited clean data requirements

🛡️ Threat Analysis

Model Poisoning

Directly defends against backdoor/trojan attacks — the paper's entire contribution is identifying poisoned (triggered) data and erasing the backdoor mapping from a poisoned model while preserving clean mapping, reducing ASR by 98%.

Details

Domains

nlp

Model Types

transformer

Threat Tags

training_timetargeted

Datasets

SST2OLIDAGnews

Applications

2025 1 cit.

Model Poisoning

82%