How Worst-Case Are Adversarial Attacks? Linking Adversarial and Perturbation Robustness
Published on arXiv
2601.14519
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Systematic benchmarking reveals conditions under which adversarial attack success meaningfully reflects robustness to random perturbations versus reflecting atypical worst-case events unlikely under stochastic noise.
Adversarial attacks are widely used to identify model vulnerabilities; however, their validity as proxies for robustness to random perturbations remains debated. We ask whether an adversarial example provides a representative estimate of misprediction risk under stochastic perturbations of the same magnitude, or instead reflects an atypical worst-case event. To address this question, we introduce a probabilistic analysis that quantifies this risk with respect to directionally biased perturbation distributions, parameterized by a concentration factor $κ$ that interpolates between isotropic noise and adversarial directions. Building on this, we study the limits of this connection by proposing an attack strategy designed to probe vulnerabilities in regimes that are statistically closer to uniform noise. Experiments on ImageNet and CIFAR-10 systematically benchmark multiple attacks, revealing when adversarial success meaningfully reflects robustness to perturbations and when it does not, thereby informing their use in safety-oriented robustness evaluation.
Key Contributions
- Probabilistic analysis framework parameterized by concentration factor κ that interpolates between isotropic noise and adversarial directions to quantify misprediction risk
- Novel attack strategy designed to probe vulnerabilities in regimes statistically closer to uniform noise, revealing structural limits of adversarial proxies
- Systematic benchmarking of multiple adversarial attacks on ImageNet and CIFAR-10 that characterizes when adversarial success reliably predicts random-perturbation robustness
🛡️ Threat Analysis
Paper analyzes adversarial examples as worst-case inference-time attacks and proposes a new attack strategy that probes model vulnerabilities in regimes closer to uniform noise — directly within the adversarial examples / input manipulation threat.