attack 2025

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

Shanmin Wang , Dongdong Zhao

0 citations · 60 references · arXiv

α

Published on arXiv

2511.12046

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

BackWeak achieves high attack success rates across diverse student architectures and KD methods using imperceptible weak triggers, without requiring surrogate student models or costly trigger optimization stages used by prior methods.

BackWeak

Novel technique introduced


Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks -- most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and they construct triggers in a way similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers -- imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's stealthiness and its potential adversarial characteristics.


Key Contributions

  • Proposes 'weak triggers' — imperceptible perturbations with negligible adversarial effect that nonetheless produce high attack success rates after distillation, challenging the assumption that UAP-like strong triggers are necessary.
  • Introduces BackWeak, a surrogate-free and lightweight KD backdoor paradigm: fine-tune a benign teacher at a small learning rate to couple the backdoor with the benign task, achieving reliable transferability to diverse student architectures.
  • Empirically demonstrates that prior KD backdoor methods (ADBA, SCAR) rely heavily on the strong adversarial nature of their UAP-like triggers rather than a genuinely implanted backdoor, and that BackWeak is simpler, more efficient, and more stealthy.

🛡️ Threat Analysis

Transfer Learning Attack

The attack specifically exploits the knowledge distillation (transfer learning) pipeline — the backdoor is engineered to couple with the benign task so it propagates from teacher to diverse student architectures during standard distillation, making KD the primary attack vector.

Model Poisoning

Core contribution is a backdoor injection technique: weak trigger perturbations are embedded into a teacher model via low-LR fine-tuning, creating hidden targeted behavior that activates only on trigger-stamped inputs while remaining dormant on clean data.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
training_timetargeteddigitalblack_box
Datasets
CIFAR-10CIFAR-100Tiny-ImageNet
Applications
image classificationmodel compressionknowledge distillation