defense 2025

BeDKD: Backdoor Defense Based on Directional Mapping Module and Adversarial Knowledge Distillation

Zhengxian Wu , Juan Wen , Wanli Peng , Yinghan Zhou , Changtong dou , Yiming Xue

0 citations · AAAI

α

Published on arXiv

2508.01595

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

BeDKD reduces attack success rate by 98% without significantly compromising clean accuracy, surpassing state-of-the-art backdoor defenses using only small subsets of clean and poisoned data.

BeDKD

Novel technique introduced


Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel \textbf{B}ackdoor d\textbf{e}fense method based on \textbf{D}irectional mapping module and adversarial \textbf{K}nowledge \textbf{D}istillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98$\%$ without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.


Key Contributions

  • Directional Mapping Module (DMM) that fine-tunes on label-flipped clean data to disrupt clean mapping and identify poisoned samples from the training set
  • Adversarial Knowledge Distillation (AKD) with a cycle iteration mechanism alternating between trust distillation (reinforcing clean mapping) and punish distillation (suppressing backdoor mapping) using minimal clean and poisoned data
  • BeDKD achieves 98% ASR reduction across SST2, OLID, and AGnews while maintaining competitive clean accuracy, outperforming state-of-the-art backdoor defenses with limited clean data requirements

🛡️ Threat Analysis

Model Poisoning

Directly defends against backdoor/trojan attacks — the paper's entire contribution is identifying poisoned (triggered) data and erasing the backdoor mapping from a poisoned model while preserving clean mapping, reducing ASR by 98%.


Details

Domains
nlp
Model Types
transformer
Threat Tags
training_timetargeted
Datasets
SST2OLIDAGnews
Applications
text classificationsentiment analysisoffensive language detectionnews categorization