defense 2025

Nearest Neighbor Projection Removal Adversarial Training

Himanshu Singh 1, A. V. Subramanyam 1, Shivank Rajput 1, Mohan Kankanhalli 2

0 citations

α

Published on arXiv

2509.07673

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Explicitly mitigating inter-class feature proximity via nearest-neighbor projection removal yields competitive robust and clean accuracy compared to leading adversarial training baselines on standard benchmarks.

Nearest Neighbor Projection Removal Adversarial Training (NNPR-AT)

Novel technique introduced


Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.


Key Contributions

  • Novel adversarial training framework that identifies nearest inter-class neighbors and removes their projections from feature representations to enforce stronger class separability.
  • Theoretical analysis showing the proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering Rademacher complexity and improving generalization.
  • Competitive robust and clean accuracy results on CIFAR-10, CIFAR-100, and SVHN benchmarks against leading adversarial training methods.

🛡️ Threat Analysis

Input Manipulation Attack

Directly proposes a defense against adversarial examples (input manipulation attacks) via a novel adversarial training framework that improves feature separability and model robustness at training time.


Details

Domains
vision
Model Types
cnn
Threat Tags
white_boxtraining_timedigital
Datasets
CIFAR-10CIFAR-100SVHN
Applications
image classification