benchmark 2025

Adversarial Examples Are Not Bugs, They Are Superposition

Liv Gorton , Owen Lewis

0 citations

α

Published on arXiv

2508.17456

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Adversarial training in ResNet18 measurably reduces superposition (via SAE loss proxy), providing causal evidence that superposition is a primary driver of adversarial vulnerability rather than a coincidental artifact.


Adversarial examples -- inputs with imperceptible perturbations that fool neural networks -- remain one of deep learning's most perplexing phenomena despite nearly a decade of research. While numerous defenses and explanations have been proposed, there is no consensus on the fundamental mechanism. One underexplored hypothesis is that superposition, a concept from mechanistic interpretability, may be a major contributing factor, or even the primary cause. We present four lines of evidence in support of this hypothesis, greatly extending prior arguments by Elhage et al. (2022): (1) superposition can theoretically explain a range of adversarial phenomena, (2) in toy models, intervening on superposition controls robustness, (3) in toy models, intervening on robustness (via adversarial training) controls superposition, and (4) in ResNet18, intervening on robustness (via adversarial training) controls superposition.


Key Contributions

  • Theoretical framework showing superposition can explain six major adversarial example phenomena including transferability, noise-like perturbation structure, and interpretability gains from adversarial training
  • Toy model experiments establishing bidirectional causal control between superposition and adversarial robustness
  • ResNet18 experiments showing adversarial training reduces superposition as measured by sparse autoencoder (SAE) reconstruction loss

🛡️ Threat Analysis

Input Manipulation Attack

Paper directly analyzes adversarial examples — their existence, transferability, noise-like structure, and relationship to adversarial training — through the lens of superposition in neural networks, with experiments on both toy models and ResNet18.


Details

Domains
vision
Model Types
cnn
Threat Tags
white_boxinference_time
Datasets
ImageNet
Applications
image classification