Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Backdoor attacks pose severe security threats to large language models (LLMs), where a model behaves normally under benign inputs but produces malicious outputs when a hidden trigger appears. Existing backdoor removal methods typically assume prior knowledge of triggers, access to a clean reference model, or rely on aggressive finetuning configurations, and are often limited to classification tasks. However, such assumptions fall apart in real-world instruction-tuned LLM settings. In this work, we propose a new framework for purifying instruction-tuned LLM without any prior trigger knowledge or clean references. Through systematic sanity checks, we find that backdoor associations are redundantly encoded across MLP layers, while attention modules primarily amplify trigger signals without establishing the behavior. Leveraging this insight, we shift the focus from isolating specific backdoor triggers to cutting off the trigger-behavior associations, and design an immunization-inspired elimination approach: by constructing multiple synthetic backdoored variants of the given suspicious model, each trained with different malicious trigger-behavior pairs, and contrasting them with their clean counterparts. The recurring modifications across variants reveal a shared "backdoor signature"-analogous to antigens in a virus. Guided by this signature, we neutralize highly suspicious components in LLM and apply lightweight finetuning to restore its fluency, producing purified models that withstand diverse backdoor attacks and threat models while preserving generative capability.

Key Contributions

First backdoor removal method for instruction-tuned LLMs requiring no prior trigger knowledge or clean reference model
Discovers that backdoor associations are redundantly encoded in MLP layers while attention modules amplify triggers
Immunization-inspired approach using synthetic backdoored variants to extract shared 'backdoor signature' for targeted component removal

🛡️ Threat Analysis

Model Poisoning

Core focus is defending against backdoor/trojan attacks in LLMs by identifying and neutralizing trigger-behavior associations embedded during training.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargeted

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Localizing Malicious Outputs from CodeLLM

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution