Nearest Neighbor Projection Removal Adversarial Training

Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

Key Contributions

Novel adversarial training framework that identifies nearest inter-class neighbors and removes their projections from feature representations to enforce stronger class separability.
Theoretical analysis showing the proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering Rademacher complexity and improving generalization.
Competitive robust and clean accuracy results on CIFAR-10, CIFAR-100, and SVHN benchmarks against leading adversarial training methods.

🛡️ Threat Analysis

Input Manipulation Attack

Directly proposes a defense against adversarial examples (input manipulation attacks) via a novel adversarial training framework that improves feature separability and model robustness at training time.

Details

Domains

vision

Model Types

cnn

Threat Tags

white_boxtraining_timedigital

Datasets

CIFAR-10CIFAR-100SVHN

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

MemLoss: Enhancing Adversarial Training with Recycling Adversarial Examples

Generalist++: A Meta-learning Framework for Mitigating Trade-off in Adversarial Training

Scaling Adversarial Training via Data Selection

Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training

The Easy Path to Robustness: Coreset Selection using Sample Hardness

Learning from Peers: Collaborative Ensemble Adversarial Training

Explanation-Guided Adversarial Training for Robust and Interpretable Models

A unified Bayesian framework for adversarial robustness