Adversarially-Aware Architecture Design for Robust Medical AI Systems
Alyssa Gerhart , Balaji Iyangar
Published on arXiv
2510.23622
Input Manipulation Attack
OWASP ML Top 10 — ML01
Data Poisoning Attack
OWASP ML Top 10 — ML02
Key Finding
Adversarial attacks (FGSM, PGD, Nightshade) significantly degrade classification accuracy on dermatological images, while adversarial training and distillation partially recover robustness at the cost of clean-data performance.
Adversarial attacks pose a severe risk to AI systems used in healthcare, capable of misleading models into dangerous misclassifications that can delay treatments or cause misdiagnoses. These attacks, often imperceptible to human perception, threaten patient safety, particularly in underserved populations. Our study explores these vulnerabilities through empirical experimentation on a dermatological dataset, where adversarial methods significantly reduce classification accuracy. Through detailed threat modeling, experimental benchmarking, and model evaluation, we demonstrate both the severity of the threat and the partial success of defenses like adversarial training and distillation. Our results show that while defenses reduce attack success rates, they must be balanced against model performance on clean data. We conclude with a call for integrated technical, ethical, and policy-based approaches to build more resilient, equitable AI in healthcare.
Key Contributions
- Empirical evaluation of adversarial evasion attacks (FGSM, PGD) and data poisoning (Nightshade) on a dermatological skin cancer dataset (ISIC)
- Benchmarking of defenses—adversarial training, defensive distillation, and hybrid methods—against both attack types in a medical imaging context
- Characterization of the robustness-accuracy trade-off and a call for integrated technical, ethical, and policy-based approaches for medical AI resilience
🛡️ Threat Analysis
Central empirical focus includes FGSM and PGD evasion attacks at inference time on dermatological image classifiers, plus evaluation of adversarial training and distillation as defenses.
Explicitly experiments with Nightshade to inject poisoned training samples into the ISIC dataset, corrupting the training pipeline to induce misclassification—core data poisoning.