defense 2025

Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models

Zhongqi Wang 1,2, Jie Zhang 1,2, Shiguang Shan 1,2, Xilin Chen 1,2

0 citations · 86 references · arXiv

α

Published on arXiv

2512.00343

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

AMDET achieves 89.90% F1 score detecting backdoors across 3,600 backdoored and benign VLP models in ~5 minutes per model, without any prior knowledge of training data, triggers, or targets

AMDET (Assimilation Matters in DETection)

Novel technique introduced


Vision-language pretrained models (VLPs) such as CLIP have achieved remarkable success, but are also highly vulnerable to backdoor attacks. Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem. Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets, or downstream classifiers, which may be impractical for real-world applications. To address this, To address this challenge, we introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge. Specifically, we first reveal the feature assimilation property in backdoored text encoders: the representations of all tokens within a backdoor sample exhibit a high similarity. Further analysis attributes this effect to the concentration of attention weights on the trigger token. Leveraging this insight, AMDET scans a model by performing gradient-based inversion on token embeddings to recover implicit features that capable of activating backdoor behaviors. Furthermore, we identify the natural backdoor feature in the OpenAI's official CLIP model, which are not intentionally injected but still exhibit backdoor-like behaviors. We then filter them out from real injected backdoor by analyzing their loss landscapes. Extensive experiments on 3,600 backdoored and benign-finetuned models with two attack paradigms and three VLP model structures show that AMDET detects backdoors with an F1 score of 89.90%. Besides, it achieves one complete detection in approximately 5 minutes on a RTX 4090 GPU and exhibits strong robustness against adaptive attacks. Code is available at: https://github.com/Robin-WZQ/AMDET


Key Contributions

  • Discovers the 'feature assimilation' property in backdoored text encoders — token representations within backdoor samples exhibit abnormally high similarity due to attention weight concentration on the trigger token
  • Proposes AMDET, a knowledge-free model-level backdoor detection framework using gradient-based inversion on token embeddings to recover implicit backdoor-activating features
  • Identifies 'natural backdoor features' in OpenAI's official CLIP model and proposes a loss-landscape-based filter to distinguish them from intentionally injected backdoors

🛡️ Threat Analysis

Model Poisoning

AMDET is a model-level backdoor detection defense targeting hidden trojan behaviors injected into VLP text encoders; it recovers implicit backdoor features via gradient-based inversion and identifies natural vs. injected backdoor features via loss landscape analysis.


Details

Domains
visionnlpmultimodal
Model Types
vlmtransformer
Threat Tags
training_timetargetedblack_box
Datasets
CLIP (OpenAI)ImageNetImageNet-RImageNet-SketchImageNet-v2
Applications
vision-language modelszero-shot image classificationtext-image retrievaltext-conditioned image generation