Adversarial Bias: Data Poisoning Attacks on Fairness
Published on arXiv
2511.08331
Data Poisoning Attack
OWASP ML Top 10 — ML02
Key Finding
The proposed poisoning attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, often achieving substantially higher unfairness with comparable or only slightly worse accuracy impact.
Adversarial Bias
Novel technique introduced
With the growing adoption of AI and machine learning systems in real-world applications, ensuring their fairness has become increasingly critical. The majority of the work in algorithmic fairness focus on assessing and improving the fairness of machine learning systems. There is relatively little research on fairness vulnerability, i.e., how an AI system's fairness can be intentionally compromised. In this work, we first provide a theoretical analysis demonstrating that a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers. Our key idea is to strategically inject a small fraction of carefully crafted adversarial data points into the training set, biasing the model's decision boundary to disproportionately affect a protected group while preserving generalizable performance. To illustrate the practical effectiveness of our method, we conduct experiments across several benchmark datasets and models. We find that our attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, often achieving substantially higher levels of unfairness with a comparable or only slightly worse impact on accuracy. Notably, our method proves effective on a wide range of models, in contrast to prior work, demonstrating a robust and potent approach to compromising the fairness of machine learning systems.
Key Contributions
- Theoretical analysis proving a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers
- Novel data poisoning attack that injects a small fraction of crafted samples to disproportionately harm a protected group while preserving overall accuracy
- Empirical demonstration that the attack outperforms existing fairness-attack methods across multiple models and benchmark datasets
🛡️ Threat Analysis
Core contribution is injecting adversarial data points into the training set to corrupt model behavior — specifically biasing the decision boundary against protected groups. This is training-time data poisoning without a hidden trigger, so ML02 applies and ML10 does not.