NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability
Krishna Kanth Nakka , Alexandre Alahi
Published on arXiv
2508.16937
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
NAT surpasses prior generative transfer attack baselines (LTP, BIA) by over 14% fooling rate in cross-model settings across 41 diverse ImageNet-pretrained architectures including CNNs, ViT, and Swin Transformers.
NAT (Neuron Attack for Transferability)
Novel technique introduced
The generation of transferable adversarial perturbations typically involves training a generator to maximize embedding separation between clean and adversarial images at a single mid-layer of a source model. In this work, we build on this approach and introduce Neuron Attack for Transferability (NAT), a method designed to target specific neuron within the embedding. Our approach is motivated by the observation that previous layer-level optimizations often disproportionately focus on a few neurons representing similar concepts, leaving other neurons within the attacked layer minimally affected. NAT shifts the focus from embedding-level separation to a more fundamental, neuron-specific approach. We find that targeting individual neurons effectively disrupts the core units of the neural network, providing a common basis for transferability across different models. Through extensive experiments on 41 diverse ImageNet models and 9 fine-grained models, NAT achieves fooling rates that surpass existing baselines by over 14\% in cross-model and 4\% in cross-domain settings. Furthermore, by leveraging the complementary attacking capabilities of the trained generators, we achieve impressive fooling rates within just 10 queries. Our code is available at: https://krishnakanthnakka.github.io/NAT/
Key Contributions
- Identifies that existing embedding-level generative attacks disproportionately disrupt only a few neurons representing similar concepts, leaving most neurons unaffected.
- Introduces NAT, a framework that trains individual adversarial generators per neuron to disrupt specific, interpretable concepts rather than all neurons simultaneously.
- Achieves 14%+ improvement in cross-model fooling rates across 41 ImageNet-pretrained models and 4%+ in cross-domain settings, with further gains using fewer than 10 queries.
🛡️ Threat Analysis
Proposes a generative adversarial attack (NAT) that crafts input perturbations causing misclassification at inference time, with the primary novelty being improved cross-model transferability via neuron-specific generator training rather than embedding-level attacks.