defense 2025

P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

Shuai Zhao 1, Xinyi Wu 2, Shiqian Zhao 1, Xiaobao Wu 1, Zhongliang Guo 1, Yanhao Jia 1, Anh Tuan Luu 1

0 citations · 50 references · arXiv

α

Published on arXiv

2510.04503

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

P2P significantly reduces backdoor attack success rate compared to baseline models across classification, reasoning, and generation tasks while maintaining clean task performance.

P2P (Poison-to-Poison)

Novel technique introduced


During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.


Key Contributions

  • P2P re-poisoning strategy that injects benign triggers with safe alternative labels into training data, forcing the model to override malicious trigger-response associations via prompt-based learning
  • Generalizable defense effective across multiple attack types (character-level, semantic, etc.) and task settings (classification, mathematical reasoning, summarization) — unlike prior defenses limited to specific tasks or attack types
  • Theoretical analysis demonstrating P2P drives attack success rate toward zero while preserving clean task performance

🛡️ Threat Analysis

Model Poisoning

Paper directly defends against data-poisoning backdoor attacks in LLMs — trigger-based attacks where models behave normally until a predefined trigger activates malicious behavior. P2P neutralizes these backdoors by overriding trigger-to-malicious-label associations with benign trigger-to-safe-label associations.


Details

Domains
nlpmultimodal
Model Types
llmtransformer
Threat Tags
training_timetargeted
Applications
text classificationmathematical reasoningtext summarizationmultimodal classification