Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.

Key Contributions

Identifies that existing output-feature-space defenses fail after fine-tuning due to parameter shifts, motivating input-side invariance as a more robust detection signal.
Proposes multi-trigger contrastive search to reverse-engineer discrete text triggers by bridging gradient-based optimization with contrastive learning objectives.
Dual-stage mitigation combining real-time input trigger monitoring with model purification via adversarial training, validated across 15 PLMs and 10 tasks.

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically targets 'transferable backdoors' that survive fine-tuning on downstream tasks — a core ML07 threat. The entire threat model and defense design are motivated by backdoors persisting across the pre-training → fine-tuning transfer learning pipeline.

Model Poisoning

Primary contribution is detecting and mitigating backdoor/trojan attacks in PLMs — achieving ≥98.7% detection recall and reducing attack success rates to clean-model baselines via input monitoring and adversarial purification.