From Theory to Practice: Evaluating Data Poisoning Attacks and Defenses in In-Context Learning on Social Media Health Discourse
Rabeya Amin Jhuma , Mostafa Mohaimen Akand Faisal
Published on arXiv
2510.03636
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
Synonym replacement and negation insertion flipped sentiment labels in up to 67% of ICL predictions; Spectral Signature Defense restored logistic regression validation accuracy to 100% while maintaining ICL accuracy at ~46.7%.
Spectral Signature Defense
Novel technique introduced
This study explored how in-context learning (ICL) in large language models can be disrupted by data poisoning attacks in the setting of public health sentiment analysis. Using tweets of Human Metapneumovirus (HMPV), small adversarial perturbations such as synonym replacement, negation insertion, and randomized perturbation were introduced into the support examples. Even these minor manipulations caused major disruptions, with sentiment labels flipping in up to 67% of cases. To address this, a Spectral Signature Defense was applied, which filtered out poisoned examples while keeping the data's meaning and sentiment intact. After defense, ICL accuracy remained steady at around 46.7%, and logistic regression validation reached 100% accuracy, showing that the defense successfully preserved the dataset's integrity. Overall, the findings extend prior theoretical studies of ICL poisoning to a practical, high-stakes setting in public health discourse analysis, highlighting both the risks and potential defenses for robust LLM deployment. This study also highlights the fragility of ICL under attack and the value of spectral defenses in making AI systems more reliable for health-related social media monitoring.
Key Contributions
- Demonstrates that minor adversarial perturbations (synonym replacement, negation insertion, randomized perturbation) to ICL support examples cause up to 67% sentiment label flips in a public health tweet setting
- Applies and evaluates Spectral Signature Defense as a sanitization method for poisoned ICL demonstrations, preserving dataset integrity
- Extends ICL poisoning research from controlled benchmarks to a real-world, noisy public health social media domain (HMPV tweets)
🛡️ Threat Analysis
Core attack is poisoning in-context learning support examples (the 'data' conditioning the model) via synonym replacement, negation insertion, and randomized perturbation, causing up to 67% label flips; Spectral Signature Defense is evaluated as a data sanitization countermeasure.