defense 2025

Backdoor Unlearning by Linear Task Decomposition

Amel Abdelraheem 1,2, Alessandro Favero 1, Gerome Bovet 2, Pascal Frossard 1

0 citations · 56 references · arXiv

α

Published on arXiv

2510.14845

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

TBAR achieves approximately 99% backdoor unlearning while retaining 96% clean accuracy on average across image classification and large-scale image-captioning benchmarks, outperforming fine-tuning-based defenses using less than 2% of their data requirements.

TBAR (Trigger removal by Backdoor ARithmetic)

Novel technique introduced


Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor's influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.


Key Contributions

  • Demonstrates that backdoors in CLIP-based transformer models are disentangled from clean task knowledge in weight space, enabling targeted linear removal without catastrophic forgetting.
  • Introduces TBAR (Trigger removal by Backdoor ARithmetic), a lightweight post-hoc unlearning method using task vector negation that unlearns ~99% of backdoor behavior while retaining 96% clean accuracy on average.
  • Extends TBAR to an attack-agnostic setting using reverse-engineered proxy triggers, outperforming state-of-the-art defenses while preserving over 90% clean accuracy with under 2% of the data requirements of fine-tuning-based defenses.

🛡️ Threat Analysis

Model Poisoning

The paper directly addresses backdoor/trojan removal from vision-language foundation models. It proposes TBAR, a defense that isolates and erases backdoor task vectors from model weights, evaluated against BadNets-style triggers and other backdoor attack types — the core threat is hidden triggered misclassification in CLIP-based models.


Details

Domains
visionmultimodal
Model Types
vlmtransformer
Threat Tags
training_timewhite_boxgrey_boxtargeted
Datasets
CIFAR-10ImageNet
Applications
image classificationimage captioningvision-language models