attack 2025

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

Yu Pan ¹, Jiahao Chen ², Wenjie Wang ¹, Bingrong Dai ³, Junjun Yang ²

¹ ShanghaiTech University

² Shanghai Polytechnic University

³ Shanghai Development Center of Computer Software Technology

0 citations

Published on arXiv

2508.03221

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

BadBlocks achieves high backdoor attack success rates using only ~30% of the computation and ~20% of the GPU time of prior methods, while effectively evading state-of-the-art attention-based backdoor detection frameworks.

BadBlocks

Novel technique introduced

Diffusion models have recently achieved remarkable success in image generation, yet growing evidence shows their vulnerability to backdoor attacks, where adversaries implant covert triggers to manipulate outputs. While existing defenses can detect many such attacks via visual inspection and neural network-based analysis, we identify a more lightweight and stealthy threat, termed BadBlocks. BadBlocks selectively contaminates specific blocks within the UNet architecture while preserving the normal behavior of the remaining components. Compared with prior methods, it requires only about 30% of the computation and 20% of the GPU time, yet achieves high attack success rates with minimal perceptual degradation. Extensive experiments demonstrate that BadBlocks can effectively evade state-of-the-art defenses, particularly attention-based detection frameworks. Ablation studies further reveal that effective backdoor injection does not require fine-tuning the entire network and highlight the critical role of certain layers in backdoor mapping. Overall, BadBlocks substantially lowers the barrier for backdooring large-scale diffusion models, even on consumer-grade GPUs.

Key Contributions

BadBlocks framework that backdoors text-to-image diffusion models by fine-tuning only the most vulnerable UNet sampling blocks, reducing GPU memory by up to 70% and training time by 80% compared to prior methods
First identification that different UNet layers exhibit varying sensitivity to backdoor injection, with analysis of the 'assimilation phenomenon' in attention layers and its implications for attention-based defenses
Fine-grained ablation revealing that full-network fine-tuning is unnecessary for effective backdoor injection, and quantifying the critical role of specific architectural components in backdoor mapping

🛡️ Threat Analysis

Model Poisoning

BadBlocks is a backdoor injection attack: it implants hidden, trigger-activated malicious behavior into diffusion model UNet weights by selectively fine-tuning specific sampling blocks, causing the model to generate predefined malicious content only when the trigger is present while behaving normally otherwise. The paper's primary contribution is this backdoor injection technique, not the supply chain distribution aspect (mention of HuggingFace is threat motivation, not the novel method).

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

white_boxtraining_timetargeteddigital

Applications

text-to-image generationimage generation

Read PDF arXiv

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models

VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

DF-LoGiT: Data-Free Logic-Gated Backdoor Attacks in Vision Transformers

Towards Backdoor Stealthiness in Model Parameter Space

Invisible Clean-Label Backdoor Attacks for Generative Data Augmentation

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models