defense 2025

Protecting the Neural Networks against FGSM Attack Using Machine Unlearning

Amir Hossein Khorasani 1, Ali Jahanian 2, Maryam Rastgarpour 1

0 citations · 26 references · arXiv

α

Published on arXiv

2511.01377

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Machine unlearning applied to LeNet significantly improves robustness against FGSM adversarial attacks compared to the baseline model.

Machine Unlearning as Adversarial Defense

Novel technique introduced


Machine learning is a powerful tool for building predictive models. However, it is vulnerable to adversarial attacks. Fast Gradient Sign Method (FGSM) attacks are a common type of adversarial attack that adds small perturbations to input data to trick a model into misclassifying it. In response to these attacks, researchers have developed methods for "unlearning" these attacks, which involves retraining a model on the original data without the added perturbations. Machine unlearning is a technique that tries to "forget" specific data points from the training dataset, to improve the robustness of a machine learning model against adversarial attacks like FGSM. In this paper, we focus on applying unlearning techniques to the LeNet neural network, a popular architecture for image classification. We evaluate the efficacy of unlearning FGSM attacks on the LeNet network and find that it can significantly improve its robustness against these types of attacks.


Key Contributions

  • Applies machine unlearning as a novel defense strategy against FGSM adversarial attacks
  • Evaluates unlearning-based robustness improvement on the LeNet architecture for image classification
  • Demonstrates that selectively 'forgetting' adversarially perturbed training samples can significantly improve model robustness

🛡️ Threat Analysis

Input Manipulation Attack

The paper directly defends against FGSM adversarial examples — gradient-based input perturbations at inference time that cause misclassification. The 'machine unlearning' mechanism is used here as an adversarial robustness defense (retraining to forget adversarial perturbation patterns), not a privacy/compliance technique, making ML01 the correct and sole category.


Details

Domains
vision
Model Types
cnn
Threat Tags
white_boxinference_timedigital
Applications
image classification