defense 2025

FeatureLens: A Highly Generalizable and Interpretable Framework for Detecting Adversarial Examples Based on Image Features

Zhigang Yang 1, Yuan Liu 1, Jiawei Zhang 1, Puning Zhang 1, Xinqiang Ma 2

0 citations · 27 references · arXiv

α

Published on arXiv

2512.03625

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Achieves 97.8%–99.75% adversarial detection accuracy in closed-set evaluation and 86.17%–99.6% in generalization evaluation using only 51-dimensional image features with models as small as 1,000 parameters.

FeatureLens

Novel technique introduced


Although the remarkable performance of deep neural networks (DNNs) in image classification, their vulnerability to adversarial attacks remains a critical challenge. Most existing detection methods rely on complex and poorly interpretable architectures, which compromise interpretability and generalization. To address this, we propose FeatureLens, a lightweight framework that acts as a lens to scrutinize anomalies in image features. Comprising an Image Feature Extractor (IFE) and shallow classifiers (e.g., SVM, MLP, or XGBoost) with model sizes ranging from 1,000 to 30,000 parameters, FeatureLens achieves high detection accuracy ranging from 97.8% to 99.75% in closed-set evaluation and 86.17% to 99.6% in generalization evaluation across FGSM, PGD, CW, and DAmageNet attacks, using only 51 dimensional features. By combining strong detection performance with excellent generalization, interpretability, and computational efficiency, FeatureLens offers a practical pathway toward transparent and effective adversarial defense.


Key Contributions

  • FeatureLens framework combining an Image Feature Extractor (IFE) with shallow classifiers (SVM, MLP, XGBoost) using only 51-dimensional features for adversarial detection
  • High detection accuracy (97.8–99.75% closed-set, 86.17–99.6% generalization) across FGSM, PGD, CW, and DAmageNet attacks with 1K–30K parameter models
  • Improved interpretability and computational efficiency over existing detection methods through lightweight, transparent classifier architectures

🛡️ Threat Analysis

Input Manipulation Attack

Directly defends against input manipulation attacks — proposes an adversarial example detection framework evaluated against canonical gradient-based attacks (FGSM, PGD, CW) and DAmageNet at inference time.


Details

Domains
vision
Model Types
cnntraditional_ml
Threat Tags
white_boxinference_timedigital
Datasets
FGSM-generatedPGD-generatedCW-generatedDAmageNet
Applications
image classification