Adversarial Bias: Data Poisoning Attacks on Fairness

With the growing adoption of AI and machine learning systems in real-world applications, ensuring their fairness has become increasingly critical. The majority of the work in algorithmic fairness focus on assessing and improving the fairness of machine learning systems. There is relatively little research on fairness vulnerability, i.e., how an AI system's fairness can be intentionally compromised. In this work, we first provide a theoretical analysis demonstrating that a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers. Our key idea is to strategically inject a small fraction of carefully crafted adversarial data points into the training set, biasing the model's decision boundary to disproportionately affect a protected group while preserving generalizable performance. To illustrate the practical effectiveness of our method, we conduct experiments across several benchmark datasets and models. We find that our attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, often achieving substantially higher levels of unfairness with a comparable or only slightly worse impact on accuracy. Notably, our method proves effective on a wide range of models, in contrast to prior work, demonstrating a robust and potent approach to compromising the fairness of machine learning systems.

Key Contributions

Theoretical analysis proving a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers
Novel data poisoning attack that injects a small fraction of crafted samples to disproportionately harm a protected group while preserving overall accuracy
Empirical demonstration that the attack outperforms existing fairness-attack methods across multiple models and benchmark datasets

🛡️ Threat Analysis

Data Poisoning Attack

Core contribution is injecting adversarial data points into the training set to corrupt model behavior — specifically biasing the decision boundary against protected groups. This is training-time data poisoning without a hidden trigger, so ML02 applies and ML10 does not.

Details

Domains

tabular

Model Types

traditional_ml

Threat Tags

training_timewhite_boxtargeteddigital

Datasets

adult incomeCOMPASGerman Credit

Applications

2025 2 cit.

Data Poisoning Attack

57%

Adversarial Bias: Data Poisoning Attacks on Fairness

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

Shilling Recommender Systems by Generating Side-feature-aware Fake User Profiles

IndirectAD: Practical Data Poisoning Attacks against Recommender Systems for Item Promotion

On Robustness of Linear Classifiers to Targeted Data Poisoning

Quality Degradation Attack in Synthetic Data

Fairness-Constrained Optimization Attack in Federated Learning

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions