defense 2026

CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

Akshit Jindal 1, Saket Anand 1, Chetan Arora 2, Vikram Goyal 1

0 citations

α

Published on arXiv

2604.09101

Model Poisoning

OWASP ML Top 10 — ML10

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

Achieves 94% backdoor detection accuracy (47/50 models) and AUROC of 0.973 using 1,000 OOD images, vastly outperforming adapted baselines (AUROC 0.495/0.687)

CLIP-Inspector

Novel technique introduced


Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, "Is the delivered model backdoored or not?" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.


Key Contributions

  • CLIP-Inspector (CI): first model-level backdoor detection method for prompt-tuned CLIP using OOD trigger inversion
  • Achieves 94% detection accuracy (47/50 models) using only 1,000 unlabeled OOD images in a single epoch
  • Demonstrates backdoor mitigation via fine-tuning on reconstructed triggers, reducing attack effectiveness

🛡️ Threat Analysis

Transfer Learning Attack

The backdoor attack exploits the transfer learning/prompt-tuning process specifically. The malicious provider follows the prompt-tuning protocol but implants backdoors during adaptation of pre-trained CLIP to downstream tasks. The attack survives the fine-tuning phase and leaves encoders untouched.

Model Poisoning

Paper addresses backdoor detection in prompt-tuned CLIP models where malicious MLaaS providers implant hidden trigger-based behavior. The backdoor causes triggered inputs to misclassify into attacker-chosen classes. CI reconstructs triggers to verify if the model is backdoored and enables mitigation.


Details

Domains
visionmultimodal
Model Types
vlmtransformermultimodal
Threat Tags
white_boxtraining_timetargeted
Datasets
ImageNetCIFAR-10CIFAR-100Caltech101Caltech256DTDFlowers102Food101OxfordPetsStanfordCars
Applications
vision-language modelsprompt-tuned clipmlaas verificationimage classification