MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD's first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.

Key Contributions

MBTSAD: a two-step backdoor mitigation framework combining token-splitting-based retraining and attention distillation that does not require pre-trained (unfine-tuned) weights
Theoretical insight showing token splitting generates Out-of-Distribution data that causes the model to learn more generalized features and disrupt backdoor patterns, framed as a simplification of the adversarial training min-max problem
Achieves backdoor mitigation comparable to pre-trained-weight-dependent methods using only 20% clean data, broadening applicability to scenarios where pre-trained weights are inaccessible

🛡️ Threat Analysis

Model Poisoning

Paper directly defends against backdoor/trojan attacks in language models — the core contribution is a backdoor mitigation technique (MBTSAD) that removes hidden trigger-activated malicious behavior from NLP models without requiring access to the original pre-trained weights.

Details

Domains

nlp

Model Types

transformerllm

Threat Tags

training_timetargeted

Applications

2025 0 cit.

Model Poisoning

91%

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Localizing Malicious Outputs from CodeLLM

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Inverting Trojans in LLMs