defense 2025

Pruning Strategies for Backdoor Defense in LLMs

Santosh Chapagain , Shah Muhammad Hamdi , Soukaina Filali Boubrahimi

0 citations

α

Published on arXiv

2508.20032

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

Gradient-based pruning performs best against syntactic backdoor triggers, while reinforcement-learning-guided and Bayesian uncertainty pruning better withstand stylistic attack triggers.

Attention-Head Pruning for Backdoor Defense

Novel technique introduced


Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine-tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine-tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2 sparsification, (iv) randomized ensemble pruning, (v) reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning. Each method iteratively removes the least informative heads while monitoring validation accuracy to avoid over-pruning. Experimental evaluation shows that gradient-based pruning performs best while defending the syntactic triggers, whereas reinforcement learning and Bayesian pruning better withstand stylistic attacks.


Key Contributions

  • Six attention-head pruning strategies (gradient-based, layer-wise variance, L1/L2 sparsification, randomized ensemble, RL-guided, and Bayesian uncertainty) for backdoor defense in LLMs
  • Trigger-agnostic and reference-model-free defense applicable by end users lacking knowledge of attack details
  • Empirical finding that gradient-based pruning best mitigates syntactic triggers while RL and Bayesian pruning better defend against stylistic attacks

🛡️ Threat Analysis

Transfer Learning Attack

The explicit threat scenario is backdoors that survive vanilla fine-tuning of pre-trained models on downstream tasks — this is a transfer learning attack vector where the backdoor persists through the pre-train → fine-tune pipeline.

Model Poisoning

Paper directly defends against backdoor/trojan attacks in pre-trained LLMs, where stealthy syntactic and stylistic triggers survive fine-tuning. The six pruning strategies are designed to purge backdoored attention heads post-hoc without knowing the trigger.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Applications
text classificationnatural language processing