P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs

During fine-tuning, large language models (LLMs) are increasingly vulnerable to data-poisoning backdoor attacks, which compromise their reliability and trustworthiness. However, existing defense strategies suffer from limited generalization: they only work on specific attack types or task settings. In this study, we propose Poison-to-Poison (P2P), a general and effective backdoor defense algorithm. P2P injects benign triggers with safe alternative labels into a subset of training samples and fine-tunes the model on this re-poisoned dataset by leveraging prompt-based learning. This enforces the model to associate trigger-induced representations with safe outputs, thereby overriding the effects of original malicious triggers. Thanks to this robust and generalizable trigger-based fine-tuning, P2P is effective across task settings and attack types. Theoretically and empirically, we show that P2P can neutralize malicious backdoors while preserving task performance. We conduct extensive experiments on classification, mathematical reasoning, and summary generation tasks, involving multiple state-of-the-art LLMs. The results demonstrate that our P2P algorithm significantly reduces the attack success rate compared with baseline models. We hope that the P2P can serve as a guideline for defending against backdoor attacks and foster the development of a secure and trustworthy LLM community.

Key Contributions

P2P re-poisoning strategy that injects benign triggers with safe alternative labels into training data, forcing the model to override malicious trigger-response associations via prompt-based learning
Generalizable defense effective across multiple attack types (character-level, semantic, etc.) and task settings (classification, mathematical reasoning, summarization) — unlike prior defenses limited to specific tasks or attack types
Theoretical analysis demonstrating P2P drives attack success rate toward zero while preserving clean task performance

🛡️ Threat Analysis

Model Poisoning

Paper directly defends against data-poisoning backdoor attacks in LLMs — trigger-based attacks where models behave normally until a predefined trigger activates malicious behavior. P2P neutralizes these backdoors by overriding trigger-to-malicious-label associations with benign trigger-to-safe-label associations.

Details

Domains

nlpmultimodal

Model Types

llmtransformer

Threat Tags

training_timetargeted

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Semantic Consensus Decoding: Backdoor Defense for Verilog Code Generation

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models