attack 2025

Strategic Sample Selection for Improved Clean-Label Backdoor Attacks in Text Classification

Onur Alp Kirci , M. Emre Gursoy

0 citations

α

Published on arXiv

2508.15934

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

The Minimum strategy significantly improves clean-label backdoor ASR over random selection and outperforms the BITE state-of-the-art clean-label attack across many configurations on BERT, DistilBERT, and RoBERTa.

Minimum/Above50/Below50 Sample Selection

Novel technique introduced


Backdoor attacks pose a significant threat to the integrity of text classification models used in natural language processing. While several dirty-label attacks that achieve high attack success rates (ASR) have been proposed, clean-label attacks are inherently more difficult. In this paper, we propose three sample selection strategies to improve attack effectiveness in clean-label scenarios: Minimum, Above50, and Below50. Our strategies identify those samples which the model predicts incorrectly or with low confidence, and by injecting backdoor triggers into such samples, we aim to induce a stronger association between the trigger patterns and the attacker-desired target label. We apply our methods to clean-label variants of four canonical backdoor attacks (InsertSent, WordInj, StyleBkd, SynBkd) and evaluate them on three datasets (IMDB, SST2, HateSpeech) and four model types (LSTM, BERT, DistilBERT, RoBERTa). Results show that the proposed strategies, particularly the Minimum strategy, significantly improve the ASR over random sample selection with little or no degradation in the model's clean accuracy. Furthermore, clean-label attacks enhanced by our strategies outperform BITE, a state of the art clean-label attack method, in many configurations.


Key Contributions

  • Three sample selection strategies (Minimum, Above50, Below50) that exploit model uncertainty to select which samples to poison, inducing stronger trigger-target label associations in clean-label scenarios.
  • Demonstration that the Minimum strategy consistently improves ASR across four backdoor attacks, three datasets, and four model types with minimal clean accuracy degradation.
  • Evidence that clean-label attacks augmented with Minimum strategy outperform BITE (SOTA clean-label attack) in many configurations, with high transformer-to-transformer surrogate transferability.

🛡️ Threat Analysis

Model Poisoning

Directly proposes improved clean-label backdoor/trojan injection into text classification models using trigger patterns (InsertSent, WordInj, StyleBkd, SynBkd); model behaves normally without trigger and misclassifies to attacker-specified label when trigger is present — canonical ML10 threat model.


Details

Domains
nlp
Model Types
transformerrnn
Threat Tags
training_timetargetedblack_box
Datasets
IMDBSST2HateSpeech
Applications
text classificationsentiment analysishate speech detection