attack 2025

ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training

Xin Yao 1, Haiyang Zhao 1, Yimin Chen 2, Jiawei Guo 1, Kecheng Huang 1, Ming Zhao 1

0 citations · 39 references · arXiv

α

Published on arXiv

2511.00446

Data Poisoning Attack

OWASP ML Top 10 — ML02

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Achieves up to 95.83% poisoning success rate and 98.68% backdoor Hit@1 on CLIP while bypassing three state-of-the-art defenses (RoCLIP, CleanCLIP, SafeCLIP)

ToxicTextCLIP

Novel technique introduced


The Contrastive Language-Image Pretraining (CLIP) model has significantly advanced vision-language modeling by aligning image-text pairs from large-scale web data through self-supervised contrastive learning. Yet, its reliance on uncurated Internet-sourced data exposes it to data poisoning and backdoor risks. While existing studies primarily investigate image-based attacks, the text modality, which is equally central to CLIP's training, remains underexplored. In this work, we introduce ToxicTextCLIP, a framework for generating high-quality adversarial texts that target CLIP during the pre-training phase. The framework addresses two key challenges: semantic misalignment caused by background inconsistency with the target class, and the scarcity of background-consistent texts. To this end, ToxicTextCLIP iteratively applies: 1) a background-aware selector that prioritizes texts with background content aligned to the target class, and 2) a background-driven augmenter that generates semantically coherent and diverse poisoned samples. Extensive experiments on classification and retrieval tasks show that ToxicTextCLIP achieves up to 95.83% poisoning success and 98.68% backdoor Hit@1, while bypassing RoCLIP, CleanCLIP and SafeCLIP defenses. The source code can be accessed via https://github.com/xinyaocse/ToxicTextCLIP/.


Key Contributions

  • ToxicTextCLIP framework for generating high-quality adversarial texts targeting CLIP's text modality during pre-training, addressing the underexplored text attack surface
  • Background-aware selector that prioritizes semantically background-consistent poisoned texts to reduce semantic misalignment with the target class
  • Background-driven augmenter that generates diverse, semantically coherent poisoned text samples, enabling successful bypass of RoCLIP, CleanCLIP, and SafeCLIP defenses

🛡️ Threat Analysis

Data Poisoning Attack

Proposes a data poisoning attack via adversarial texts injected into CLIP's training data, achieving 95.83% poisoning success rate by corrupting the training distribution.

Model Poisoning

Proposes a backdoor attack on CLIP pre-training where specific text triggers cause targeted misclassification/retrieval behavior, achieving 98.68% backdoor Hit@1 with hidden trigger-based behavior.


Details

Domains
multimodalvisionnlp
Model Types
vlmtransformer
Threat Tags
training_timetargeteddigital
Datasets
Conceptual Captions (CC3M)ImageNetCIFAR-10
Applications
vision-language pre-trainingzero-shot image classificationimage-text retrieval