Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small $L^2$ norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.

Key Contributions

Formulates the estimation of Trigger-Activated Changes (TAC) as a convex quadratic optimization problem, yielding a more accurate surrogate than prior methods
Identifies the poisoned class by detecting statistically anomalous (small) L² norms of the computed minimal perturbations across all classes
Leverages the poisoned-class perturbation during fine-tuning to suppress backdoor behavior while preserving clean accuracy across CIFAR-10, GTSRB, and TinyImageNet

🛡️ Threat Analysis

Model Poisoning

Directly defends against model backdoor/trojan attacks by identifying poisoned classes and removing backdoor behavior through fine-tuning; tested against six representative backdoor attacks including BadNets, WaNet, and LiRa.

Details

Domains

vision

Model Types

cnntransformer

Threat Tags

training_timetargeteddigital

Datasets

CIFAR-10GTSRBTinyImageNet

Applications

2026 0 cit.

Model Poisoning

83%

Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning

Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Backdoor Mitigation via Invertible Pruning Masks

Isolate Trigger: Detecting and Eliminating Adaptive Backdoor Attacks

NT-ML: Backdoor Defense via Non-target Label Training and Mutual Learning

Illuminating the Black Box: Real-Time Monitoring of Backdoor Unlearning in CNNs via Explainable AI

DSBA: Dynamic Stealthy Backdoor Attack with Collaborative Optimization in Self-Supervised Learning