On the Effects of Adversarial Perturbations on Distribution Robustness
Yipei Wang , Zhaoying Pan , Xiaoqian Wang
Published on arXiv
2601.16464
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
ℓ∞ adversarial perturbations on data with moderate bias can increase distribution robustness, and this gain persists on highly skewed data when simplicity bias induces reliance on core features with greater separability.
Adversarial robustness refers to a model's ability to resist perturbation of inputs, while distribution robustness evaluates the performance of the model under data shifts. Although both aim to ensure reliable performance, prior work has revealed a tradeoff in distribution and adversarial robustness. Specifically, adversarial training might increase reliance on spurious features, which can harm distribution robustness, especially the performance on some underrepresented subgroups. We present a theoretical analysis of adversarial and distribution robustness that provides a tractable surrogate for per-step adversarial training by studying models trained on perturbed data. In addition to the tradeoff, our work further identified a nuanced phenomenon that $\ell_\infty$ perturbations on data with moderate bias can yield an increase in distribution robustness. Moreover, the gain in distribution robustness remains on highly skewed data when simplicity bias induces reliance on the core feature, characterized as greater feature separability. Our theoretical analysis extends the understanding of the tradeoff by highlighting the interplay of the tradeoff and the feature separability. Despite the tradeoff that persists in many cases, overlooking the role of feature separability may lead to misleading conclusions about robustness.
Key Contributions
- Theoretical surrogate for per-step adversarial training by analyzing models trained on perturbed data
- Identifies conditions under which ℓ∞ perturbations on moderately biased data yield gains in distribution robustness rather than harm
- Characterizes the role of feature separability (via simplicity bias) in determining whether the adversarial–distribution robustness tradeoff holds or reverses
🛡️ Threat Analysis
Paper analyzes adversarial training (the canonical defense against adversarial perturbation attacks) and characterizes when ℓ∞ perturbations during training help or hurt model robustness — directly contributing to understanding of adversarial defenses.