Practical, Generalizable and Robust Backdoor Attacks on Text-to-Image Diffusion Models
Haoran Dai 1, Jiawen Wang 1, Ruo Yang 2, Manali Sharma 2, Zhonghao Liao 3, Yuan Hong 4, Binghui Wang 1
Published on arXiv
2508.01605
Model Poisoning
OWASP ML Top 10 — ML10
Data Poisoning Attack
OWASP ML Top 10 — ML02
Key Finding
Achieves >90% attack success rate with only 10 poisoned training samples and negligible degradation in benign generation quality, while remaining effective against state-of-the-art backdoor defenses.
Text-to-image diffusion models (T2I DMs) have achieved remarkable success in generating high-quality and diverse images from text prompts, yet recent studies have revealed their vulnerability to backdoor attacks. Existing attack methods suffer from critical limitations: 1) they rely on unnatural adversarial prompts that lack human readability and require massive poisoned data; 2) their effectiveness is typically restricted to specific models, lacking generalizability; and 3) they can be mitigated by recent backdoor defenses. To overcome these challenges, we propose a novel backdoor attack framework that achieves three key properties: 1) \emph{Practicality}: Our attack requires only a few stealthy backdoor samples to generate arbitrary attacker-chosen target images, as well as ensuring high-quality image generation in benign scenarios. 2) \emph{Generalizability:} The attack is applicable across multiple T2I DMs without requiring model-specific redesign. 3) \emph{Robustness:} The attack remains effective against existing backdoor defenses and adaptive defenses. Our extensive experimental results on multiple T2I DMs demonstrate that with only 10 carefully crafted backdoored samples, our attack method achieves $>$90\% attack success rate with negligible degradation in benign image generation quality. We also conduct human evaluation to validate our attack effectiveness. Furthermore, recent backdoor detection and mitigation methods, as well as adaptive defense tailored to our attack are not sufficiently effective, highlighting the pressing need for more robust defense mechanisms against the proposed attack.
Key Contributions
- Practical backdoor attack requiring only 10 poisoned samples with natural, human-readable text triggers achieving >90% attack success rate on T2I diffusion models.
- Generalizable attack framework applicable across multiple T2I architectures (Stable Diffusion, SDXL, Imagen) without model-specific redesign.
- Robustness demonstration against existing and adaptive backdoor defenses including T2IShield and TERD, highlighting defense gaps.
🛡️ Threat Analysis
The attack mechanism is data poisoning: injecting a small number of carefully crafted backdoored training samples to implant the backdoor, making ML02 co-applicable alongside ML10.
Core contribution is embedding hidden, targeted backdoor behavior in T2I diffusion models that activates with specific text triggers to produce attacker-chosen images while behaving normally otherwise — classic neural trojan.