defense 2025

SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

Mohamed Afane 1, Abhishek Satyam 1, Ke Chen 2, Tao Li 3, Junaid Farooq 4, Juntao Chen 1

0 citations · 37 references · BigData Congress

α

Published on arXiv

2512.10998

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

SCOUT successfully detects contextually-appropriate backdoor triggers that evade perplexity- and attention-based defenses, while maintaining clean accuracy across standard NLP benchmarks.

SCOUT (Saliency-based Classification Of Untrusted Tokens)

Novel technique introduced


Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors exploit domain-specific vocabulary while maintaining semantic coherence, demonstrating how adversaries can weaponize contextual appropriateness to evade conventional detection methods. To counter both traditional and these sophisticated attacks, we present \textbf{SCOUT (Saliency-based Classification Of Untrusted Tokens)}, a novel defense framework that identifies backdoor triggers through token-level saliency analysis rather than traditional context-based detection methods. SCOUT constructs a saliency map by measuring how the removal of individual tokens affects the model's output logits for the target label, enabling detection of both conspicuous and subtle manipulation attempts. We evaluate SCOUT on established benchmark datasets (SST-2, IMDB, AG News) against conventional attacks (BadNet, AddSent, SynBkd, StyleBkd) and our novel attacks, demonstrating that SCOUT successfully detects these sophisticated threats while preserving accuracy on clean inputs.


Key Contributions

  • Three novel contextually-aware backdoor attack scenarios (ViralApp, Fever, Referral) that exploit domain-specific vocabulary to evade conventional detection
  • SCOUT defense framework using token-level saliency maps — measuring per-token impact on output logits — to detect both conspicuous and semantically coherent backdoor triggers
  • Evaluation on SST-2, IMDB, and AG News against four established attacks and three novel attacks, demonstrating robustness while preserving clean accuracy

🛡️ Threat Analysis

Data Poisoning Attack

The backdoor injection mechanism is explicitly data poisoning of the fine-tuning dataset; SCOUT's defense filters poisoned training samples, and the paper frames itself as a 'defense against data poisoning attacks' throughout.

Model Poisoning

Primary focus is on backdoor attacks with hidden trigger-based targeted behavior (ViralApp, Fever, Referral attacks; BadNet, AddSent, SynBkd, StyleBkd) and SCOUT as a trigger-detection defense via saliency analysis — classic backdoor/trojan threat model.


Details

Domains
nlp
Model Types
transformerllm
Threat Tags
training_timetargetedgrey_box
Datasets
SST-2IMDBAG News
Applications
text classificationmedical diagnosisclinical nlpsocial media addiction classification