Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models
Tianhang Zhao 1,2, Wei Du 3, Haodong Zhao 1, Sufeng Duan 1, Gongshen Liu 1,2
Published on arXiv
2512.06899
Model Poisoning
OWASP ML Top 10 — ML10
Transfer Learning Attack
OWASP ML Top 10 — ML07
Key Finding
Achieves ≥98.7% backdoor detection recall and reduces attack success rates to clean-model levels across 15 PLMs and 10 tasks, outperforming all state-of-the-art baselines.
Patronus
Novel technique introduced
Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.
Key Contributions
- Identifies that existing output-feature-space defenses fail after fine-tuning due to parameter shifts, motivating input-side invariance as a more robust detection signal.
- Proposes multi-trigger contrastive search to reverse-engineer discrete text triggers by bridging gradient-based optimization with contrastive learning objectives.
- Dual-stage mitigation combining real-time input trigger monitoring with model purification via adversarial training, validated across 15 PLMs and 10 tasks.
🛡️ Threat Analysis
The paper specifically targets 'transferable backdoors' that survive fine-tuning on downstream tasks — a core ML07 threat. The entire threat model and defense design are motivated by backdoors persisting across the pre-training → fine-tuning transfer learning pipeline.
Primary contribution is detecting and mitigating backdoor/trojan attacks in PLMs — achieving ≥98.7% detection recall and reducing attack success rates to clean-model baselines via input monitoring and adversarial purification.