Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio\textbf{N} discr\textbf{E}pancy consis\textbf{T}ency \textbf{E}valuation (\NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.

Key Contributions

Identifies that the change in perturbation discrepancy for backdoor samples is smaller and more consistent than for clean samples, enabling trigger-agnostic detection
Proposes NETE, a zero-shot black-box detection method using log probability curvature under mask-filling perturbations — requires only an off-the-shelf pre-trained model with no access to poisoned model, clean reference data, or high compute
Demonstrates effectiveness both pre-training (filtering poisoned training data) and post-training (blocking triggered inputs at inference), outperforming existing zero-shot black-box baselines across nine backdoor attack variants

🛡️ Threat Analysis

Model Poisoning

Primary contribution is a defense against backdoor/trojan attacks in pre-trained language models — detects backdoor samples (triggered inputs) to prevent backdoor activation at inference or injection during training, evaluated against four standard backdoor attacks and five LLM-specific backdoor attack types.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxtraining_timeinference_time

Applications

2025 0 cit.

Model Poisoning

83%