Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation
Giovanni De Muri , Mark Vero , Robin Staab , Martin Vechev
Published on arXiv
2510.18541
Model Poisoning
OWASP ML Top 10 — ML10
Transfer Learning Attack
OWASP ML Top 10 — ML07
Key Finding
T-MTB successfully transfers backdoors from teacher to student LLMs through knowledge distillation — a risk underestimated by prior work — across jailbreaking and content modulation scenarios on four model families.
T-MTB
Novel technique introduced
LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.
Key Contributions
- Demonstrates empirically that existing LLM backdoors fail to transfer through knowledge distillation due to rare trigger token selection.
- Proposes T-MTB, a composite backdoor trigger composed of individually common tokens, enabling stealthy yet transferable backdoors through distillation.
- Extensively evaluates transferable backdoors across jailbreaking and content modulation attack scenarios on four LLM families.
🛡️ Threat Analysis
The paper specifically targets the knowledge distillation pipeline: backdoors are designed to exploit and survive the teacher-to-student transfer process, directly matching the ML07 'backdoors that survive fine-tuning/distillation' threat model.
Core contribution is T-MTB, a backdoor injection technique that embeds hidden, trigger-activated malicious behavior (jailbreaking, content modulation) into LLMs — the defining ML10 threat.