attack 2025

Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs

Jing Cui 1,2, Yufei Han 3, Jianbin Jiao 1, Junge Zhang 1,2

0 citations · 24 references · arXiv

α

Published on arXiv

2512.14741

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

P-Trojan achieves over 99% backdoor persistence across repeated LLM fine-tuning rounds while prior methods degrade by 50–70% in effectiveness.

P-Trojan

Novel technique introduced


Backdoor attacks embed malicious behaviors into Large Language Models (LLMs), enabling adversaries to trigger harmful outputs or bypass safety controls. However, the persistence of the implanted backdoors under user-driven post-deployment continual fine-tuning has been rarely examined. Most prior works evaluate the effectiveness and generalization of implanted backdoors only at releasing and empirical evidence shows that naively injected backdoor persistence degrades after updates. In this work, we study whether and how implanted backdoors persist through a multi-stage post-deployment fine-tuning. We propose P-Trojan, a trigger-based attack algorithm that explicitly optimizes for backdoor persistence across repeated updates. By aligning poisoned gradients with those of clean tasks on token embeddings, the implanted backdoor mapping is less likely to be suppressed or forgotten during subsequent updates. Theoretical analysis shows the feasibility of such persistent backdoor attacks after continual fine-tuning. And experiments conducted on the Qwen2.5 and LLaMA3 families of LLMs, as well as diverse task sequences, demonstrate that P-Trojan achieves over 99% persistence while preserving clean-task accuracy. Our findings highlight the need for persistence-aware evaluation and stronger defenses in realistic model adaptation pipelines.


Key Contributions

  • P-Trojan: a backdoor injection method that aligns poisoned gradients with clean task gradients on token embeddings, preventing the trigger-response mapping from being overwritten during user-driven continual fine-tuning
  • Theoretical analysis proving the feasibility of persistent backdoor attacks under continual fine-tuning without any knowledge of future downstream tasks
  • Empirical demonstration on Qwen2.5 and LLaMA3 that P-Trojan achieves >99% persistence and 2–4× higher post-fine-tuning attack success than prior methods (BadNet, BadEdit)

🛡️ Threat Analysis

Transfer Learning Attack

The paper's primary novelty is explicitly making backdoors PERSIST through repeated post-deployment fine-tuning. Designing backdoors that survive fine-tuning is the defining ML07 threat, and persistence-through-fine-tuning is the core research question, not merely a secondary concern.

Model Poisoning

P-Trojan is a trigger-based backdoor attack that embeds hidden malicious behavior into LLMs, activating only when specific triggers appear — the canonical ML10 threat.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timetargeted
Applications
large language model deploymentllm safety systems