Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models
Zuquan Peng , Jianming Fu , Lixin Zou , Li Zheng , Yanzhen Ren , Guojun Peng
Published on arXiv
2509.05318
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
NETE outperforms existing zero-shot black-box backdoor sample detection methods across four classic NLP backdoor attacks and five large language model backdoor attack types.
NETE (perturbatioN discrepancy consistEncy evaluaTion)
Novel technique introduced
The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio\textbf{N} discr\textbf{E}pancy consis\textbf{T}ency \textbf{E}valuation (\NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.
Key Contributions
- Identifies that the change in perturbation discrepancy for backdoor samples is smaller and more consistent than for clean samples, enabling trigger-agnostic detection
- Proposes NETE, a zero-shot black-box detection method using log probability curvature under mask-filling perturbations — requires only an off-the-shelf pre-trained model with no access to poisoned model, clean reference data, or high compute
- Demonstrates effectiveness both pre-training (filtering poisoned training data) and post-training (blocking triggered inputs at inference), outperforming existing zero-shot black-box baselines across nine backdoor attack variants
🛡️ Threat Analysis
Primary contribution is a defense against backdoor/trojan attacks in pre-trained language models — detects backdoor samples (triggered inputs) to prevent backdoor activation at inference or injection during training, evaluated against four standard backdoor attacks and five LLM-specific backdoor attack types.