Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process relies solely on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and harm the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct image-caption pairs, named OTCCLIP. We propose a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks. Also, compared to previous methods, OTCCLIP significantly improves CLIP's zero-shot and linear probing performance trained on poisoned datasets.

Key Contributions

Optimal transport-based fine-grained distance measure between image patch and caption token feature sets for detecting and correcting poisoned image-caption pairs during CLIP pre-training
Caption re-assignment mechanism using OT transport matrices as weights to capture patch-token region correspondences, improving robustness over global-feature-only matching
Inter- and intra-modality fine-grained alignment objectives using OT to reduce harm from residual mismatched pairs after correction

🛡️ Threat Analysis

Data Poisoning Attack

The paper explicitly targets targeted data poisoning attacks (TDPAs) on CLIP pre-training, where adversarial image-caption pairs are injected into massive crawled training datasets. The defense disrupts these poisoned pairs by remapping captions via optimal transport distance.

Model Poisoning

The paper also explicitly defends against backdoor attacks (BAs) on CLIP, where trigger insertion into as little as 0.01% of pre-training data induces targeted misclassification — the canonical backdoor/trojan threat model.