Robustness Feature Adapter for Efficient Adversarial Training
Quanwei Wu 1, Jun Guo 1, Wei Wang 2, Yi Wang 1
Published on arXiv
2508.17680
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Feature-space adapter significantly improves computational efficiency and robust generalization to unseen attacks while eliminating robust overfitting across CNN and ViT architectures.
RFA (Robustness Feature Adapter)
Novel technique introduced
Adversarial training (AT) with projected gradient descent is the most popular method to improve model robustness under adversarial attacks. However, computational overheads become prohibitively large when AT is applied to large backbone models. AT is also known to have the issue of robust overfitting. This paper contributes to solving both problems simultaneously towards building more trustworthy foundation models. In particular, we propose a new adapter-based approach for efficient AT directly in the feature space. We show that the proposed adapter-based approach can improve the inner-loop convergence quality by eliminating robust overfitting. As a result, it significantly increases computational efficiency and improves model accuracy by generalizing adversarial robustness to unseen attacks. We demonstrate the effectiveness of the new adapter-based approach in different backbone architectures and in AT at scale.
Key Contributions
- Robustness Feature Adapter (RFA) module that performs adversarial perturbation directly in feature space rather than input space, reducing computational overhead of adversarial training
- Demonstrates that feature-space perturbation eliminates robust overfitting by improving inner-loop convergence quality in PGD-based AT
- Plug-in RFA design compatible with multiple backbone architectures (CNN, ViT) and usable for adversarial detection at inference time
🛡️ Threat Analysis
Directly addresses adversarial robustness via a new adapter-based adversarial training defense. The RFA module generates perturbations in feature space to improve PGD-based adversarial training, defending against adversarial input manipulation attacks at inference time and generalizing to unseen attacks.