Adversarial Samples Are Not Created Equal

Over the past decade, numerous theories have been proposed to explain the widespread vulnerability of deep neural networks to adversarial evasion attacks. Among these, the theory of non-robust features proposed by Ilyas et al. has been widely accepted, showing that brittle but predictive features of the data distribution can be directly exploited by attackers. However, this theory overlooks adversarial samples that do not directly utilize these features. In this work, we advocate that these two kinds of samples - those which use use brittle but predictive features and those that do not - comprise two types of adversarial weaknesses and should be differentiated when evaluating adversarial robustness. For this purpose, we propose an ensemble-based metric to measure the manipulation of non-robust features by adversarial perturbations and use this metric to analyze the makeup of adversarial samples generated by attackers. This new perspective also allows us to re-examine multiple phenomena, including the impact of sharpness-aware minimization on adversarial robustness and the robustness gap observed between adversarially training and standard training on robust datasets.

Key Contributions

Ensemble-based metric to measure whether adversarial perturbations exploit non-robust features, enabling classification of adversarial samples into two distinct types
Empirical finding that adversarially trained models resist 'adversarial bugs' (non-feature-exploiting attacks) but fail when perturbations are large enough to manipulate predictive features
Discovery that Sharpness-Aware Minimization (SAM) provides targeted protection against adversarial bugs, and that robust datasets still contain non-robust features explaining the AT vs. robust dataset robustness gap

🛡️ Threat Analysis

Input Manipulation Attack

Directly analyzes adversarial evasion attacks, proposing a metric to classify adversarial perturbations by whether they exploit non-robust features, and re-examines adversarial training and robustness phenomena.

Details

Domains

vision

Model Types

cnn

Threat Tags

inference_timedigitalwhite_box

Datasets

CIFAR-10SVHN

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

On Integer Programming for the Binarized Neural Network Verification Problem

Non-Parametric Probabilistic Robustness: A Conservative Metric with Optimized Perturbation Distributions

Does simple trump complex? Comparing strategies for adversarial robustness in DNNs

Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification

Volatility in Certainty (VC): A Metric for Detecting Adversarial Perturbations During Inference in Neural Network Classifiers

When Flatness Does (Not) Guarantee Adversarial Robustness

Verifying Local Robustness of Pruned Safety-Critical Networks

Adversarial Attacks Leverage Interference Between Features in Superposition