Adversarial Samples Are Not Created Equal
Jennifer Crawford 1, Amol Khanna 2, Fred Lu 3, Amy R. Wagoner 4, Stella Biderman 4, Andre T. Nguyen 4, Edward Raff 2
Published on arXiv
2601.00577
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Adversarially trained models display striking resilience to adversarial bugs but break down when perturbation strength is sufficient to manipulate non-robust predictive features, and SAM specifically targets adversarial bug protection rather than robust feature learning
Non-Robust Feature Manipulation Metric
Novel technique introduced
Over the past decade, numerous theories have been proposed to explain the widespread vulnerability of deep neural networks to adversarial evasion attacks. Among these, the theory of non-robust features proposed by Ilyas et al. has been widely accepted, showing that brittle but predictive features of the data distribution can be directly exploited by attackers. However, this theory overlooks adversarial samples that do not directly utilize these features. In this work, we advocate that these two kinds of samples - those which use use brittle but predictive features and those that do not - comprise two types of adversarial weaknesses and should be differentiated when evaluating adversarial robustness. For this purpose, we propose an ensemble-based metric to measure the manipulation of non-robust features by adversarial perturbations and use this metric to analyze the makeup of adversarial samples generated by attackers. This new perspective also allows us to re-examine multiple phenomena, including the impact of sharpness-aware minimization on adversarial robustness and the robustness gap observed between adversarially training and standard training on robust datasets.
Key Contributions
- Ensemble-based metric to measure whether adversarial perturbations exploit non-robust features, enabling classification of adversarial samples into two distinct types
- Empirical finding that adversarially trained models resist 'adversarial bugs' (non-feature-exploiting attacks) but fail when perturbations are large enough to manipulate predictive features
- Discovery that Sharpness-Aware Minimization (SAM) provides targeted protection against adversarial bugs, and that robust datasets still contain non-robust features explaining the AT vs. robust dataset robustness gap
🛡️ Threat Analysis
Directly analyzes adversarial evasion attacks, proposing a metric to classify adversarial perturbations by whether they exploit non-robust features, and re-examines adversarial training and robustness phenomena.