defense 2025

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Tianhang Zhao 1,2, Wei Du 3, Haodong Zhao 1, Sufeng Duan 1, Gongshen Liu 1,2

3 citations · 59 references · arXiv

α

Published on arXiv

2512.06899

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

Achieves ≥98.7% backdoor detection recall and reduces attack success rates to clean-model levels across 15 PLMs and 10 tasks, outperforming all state-of-the-art baselines.

Patronus

Novel technique introduced


Transferable backdoors pose a severe threat to the Pre-trained Language Models (PLMs) supply chain, yet defensive research remains nascent, primarily relying on detecting anomalies in the output feature space. We identify a critical flaw that fine-tuning on downstream tasks inevitably modifies model parameters, shifting the output distribution and rendering pre-computed defense ineffective. To address this, we propose Patronus, a novel framework that use input-side invariance of triggers against parameter shifts. To overcome the convergence challenges of discrete text optimization, Patronus introduces a multi-trigger contrastive search algorithm that effectively bridges gradient-based optimization with contrastive learning objectives. Furthermore, we employ a dual-stage mitigation strategy combining real-time input monitoring with model purification via adversarial training. Extensive experiments across 15 PLMs and 10 tasks demonstrate that Patronus achieves $\geq98.7\%$ backdoor detection recall and reduce attack success rates to clean settings, significantly outperforming all state-of-the-art baselines in all settings. Code is available at https://github.com/zth855/Patronus.


Key Contributions

  • Identifies that existing output-feature-space defenses fail after fine-tuning due to parameter shifts, motivating input-side invariance as a more robust detection signal.
  • Proposes multi-trigger contrastive search to reverse-engineer discrete text triggers by bridging gradient-based optimization with contrastive learning objectives.
  • Dual-stage mitigation combining real-time input trigger monitoring with model purification via adversarial training, validated across 15 PLMs and 10 tasks.

🛡️ Threat Analysis

Transfer Learning Attack

The paper specifically targets 'transferable backdoors' that survive fine-tuning on downstream tasks — a core ML07 threat. The entire threat model and defense design are motivated by backdoors persisting across the pre-training → fine-tuning transfer learning pipeline.

Model Poisoning

Primary contribution is detecting and mitigating backdoor/trojan attacks in PLMs — achieving ≥98.7% detection recall and reducing attack success rates to clean-model baselines via input monitoring and adversarial purification.


Details

Domains
nlp
Model Types
transformerllm
Threat Tags
training_timeinference_timetargeted
Applications
nlp downstream taskstext classificationpre-trained language model fine-tuning