Protecting the Neural Networks against FGSM Attack Using Machine Unlearning

Machine learning is a powerful tool for building predictive models. However, it is vulnerable to adversarial attacks. Fast Gradient Sign Method (FGSM) attacks are a common type of adversarial attack that adds small perturbations to input data to trick a model into misclassifying it. In response to these attacks, researchers have developed methods for "unlearning" these attacks, which involves retraining a model on the original data without the added perturbations. Machine unlearning is a technique that tries to "forget" specific data points from the training dataset, to improve the robustness of a machine learning model against adversarial attacks like FGSM. In this paper, we focus on applying unlearning techniques to the LeNet neural network, a popular architecture for image classification. We evaluate the efficacy of unlearning FGSM attacks on the LeNet network and find that it can significantly improve its robustness against these types of attacks.

Key Contributions

Applies machine unlearning as a novel defense strategy against FGSM adversarial attacks
Evaluates unlearning-based robustness improvement on the LeNet architecture for image classification
Demonstrates that selectively 'forgetting' adversarially perturbed training samples can significantly improve model robustness

🛡️ Threat Analysis

Input Manipulation Attack

The paper directly defends against FGSM adversarial examples — gradient-based input perturbations at inference time that cause misclassification. The 'machine unlearning' mechanism is used here as an adversarial robustness defense (retraining to forget adversarial perturbation patterns), not a privacy/compliance technique, making ML01 the correct and sole category.

Details

Domains

vision

Model Types

cnn

Threat Tags

white_boxinference_timedigital

Applications

2025 0 cit.

Input Manipulation Attack

100%