Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

Key Contributions

Hypothesis and empirical validation that backdoor triggers induce consistent, structured directions in a model's latent spaces across multiple layers
FIRE: a lightweight inference-time mitigation that moves poisoned samples' latent representations along reversed backdoor directions to restore correct predictions without modifying model weights or requiring retraining
Evaluation showing FIRE outperforms existing runtime backdoor mitigations across various attacks, datasets, and architectures with low computational overhead

🛡️ Threat Analysis

Model Poisoning

FIRE directly defends against backdoor/trojan attacks by identifying trigger-induced directions in latent space and reversing them at inference time to neutralize poisoned samples — the canonical ML10 threat.

Details

Domains

vision

Model Types

cnntransformer

Threat Tags

training_timeinference_timetargeteddigital

Applications

2026 0 cit.

Model Poisoning

80%

Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning

Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Isolate Trigger: Detecting and Eliminating Adaptive Backdoor Attacks

Backdoor Mitigation via Invertible Pruning Masks

NT-ML: Backdoor Defense via Non-target Label Training and Mutual Learning

Illuminating the Black Box: Real-Time Monitoring of Backdoor Unlearning in CNNs via Explainable AI

BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder