defense 2025

SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

0 citations · 37 references · BigData Congress

Published on arXiv

2512.10998

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

SCOUT successfully detects contextually-appropriate backdoor triggers that evade perplexity- and attention-based defenses, while maintaining clean accuracy across standard NLP benchmarks.

SCOUT (Saliency-based Classification Of Untrusted Tokens)

Novel technique introduced

Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors exploit domain-specific vocabulary while maintaining semantic coherence, demonstrating how adversaries can weaponize contextual appropriateness to evade conventional detection methods. To counter both traditional and these sophisticated attacks, we present \textbf{SCOUT (Saliency-based Classification Of Untrusted Tokens)}, a novel defense framework that identifies backdoor triggers through token-level saliency analysis rather than traditional context-based detection methods. SCOUT constructs a saliency map by measuring how the removal of individual tokens affects the model's output logits for the target label, enabling detection of both conspicuous and subtle manipulation attempts. We evaluate SCOUT on established benchmark datasets (SST-2, IMDB, AG News) against conventional attacks (BadNet, AddSent, SynBkd, StyleBkd) and our novel attacks, demonstrating that SCOUT successfully detects these sophisticated threats while preserving accuracy on clean inputs.

Key Contributions

Three novel contextually-aware backdoor attack scenarios (ViralApp, Fever, Referral) that exploit domain-specific vocabulary to evade conventional detection
SCOUT defense framework using token-level saliency maps — measuring per-token impact on output logits — to detect both conspicuous and semantically coherent backdoor triggers
Evaluation on SST-2, IMDB, and AG News against four established attacks and three novel attacks, demonstrating robustness while preserving clean accuracy

🛡️ Threat Analysis

Data Poisoning Attack

The backdoor injection mechanism is explicitly data poisoning of the fine-tuning dataset; SCOUT's defense filters poisoned training samples, and the paper frames itself as a 'defense against data poisoning attacks' throughout.

Model Poisoning

Primary focus is on backdoor attacks with hidden trigger-based targeted behavior (ViralApp, Fever, Referral attacks; BadNet, AddSent, SynBkd, StyleBkd) and SCOUT as a trigger-detection defense via saliency analysis — classic backdoor/trojan threat model.

Details

Domains

nlp

Model Types

transformerllm

Threat Tags

training_timetargetedgrey_box

Datasets

SST-2IMDBAG News

Applications

text classificationmedical diagnosisclinical nlpsocial media addiction classification

Read PDF arXiv DOI Code

SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models

Your RAG is Unfair: Exposing Fairness Vulnerabilities in Retrieval-Augmented Generation via Backdoor Attacks

Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models