benchmark 2026

Adversarial Samples Are Not Created Equal

Jennifer Crawford 1, Amol Khanna 2, Fred Lu 3, Amy R. Wagoner 4, Stella Biderman 4, Andre T. Nguyen 4, Edward Raff 2

0 citations · 31 references · arXiv

α

Published on arXiv

2601.00577

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Adversarially trained models display striking resilience to adversarial bugs but break down when perturbation strength is sufficient to manipulate non-robust predictive features, and SAM specifically targets adversarial bug protection rather than robust feature learning

Non-Robust Feature Manipulation Metric

Novel technique introduced


Over the past decade, numerous theories have been proposed to explain the widespread vulnerability of deep neural networks to adversarial evasion attacks. Among these, the theory of non-robust features proposed by Ilyas et al. has been widely accepted, showing that brittle but predictive features of the data distribution can be directly exploited by attackers. However, this theory overlooks adversarial samples that do not directly utilize these features. In this work, we advocate that these two kinds of samples - those which use use brittle but predictive features and those that do not - comprise two types of adversarial weaknesses and should be differentiated when evaluating adversarial robustness. For this purpose, we propose an ensemble-based metric to measure the manipulation of non-robust features by adversarial perturbations and use this metric to analyze the makeup of adversarial samples generated by attackers. This new perspective also allows us to re-examine multiple phenomena, including the impact of sharpness-aware minimization on adversarial robustness and the robustness gap observed between adversarially training and standard training on robust datasets.


Key Contributions

  • Ensemble-based metric to measure whether adversarial perturbations exploit non-robust features, enabling classification of adversarial samples into two distinct types
  • Empirical finding that adversarially trained models resist 'adversarial bugs' (non-feature-exploiting attacks) but fail when perturbations are large enough to manipulate predictive features
  • Discovery that Sharpness-Aware Minimization (SAM) provides targeted protection against adversarial bugs, and that robust datasets still contain non-robust features explaining the AT vs. robust dataset robustness gap

🛡️ Threat Analysis

Input Manipulation Attack

Directly analyzes adversarial evasion attacks, proposing a metric to classify adversarial perturbations by whether they exploit non-robust features, and re-examines adversarial training and robustness phenomena.


Details

Domains
vision
Model Types
cnn
Threat Tags
inference_timedigitalwhite_box
Datasets
CIFAR-10SVHN
Applications
image classification