SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack

With advances in diffusion models, image generation has shown significant performance improvements. This raises concerns about the potential abuse of image generation, such as the creation of explicit or violent images, commonly referred to as Not Safe For Work (NSFW) content. To address this, the Stable Diffusion model includes several safety checkers to censor initial text prompts and final output images generated from the model. However, recent research has shown that these safety checkers have vulnerabilities against adversarial attacks, allowing them to generate NSFW images. In this paper, we find that these adversarial attacks are not robust to small changes in text prompts or input latents. Based on this, we propose SC-Pro (Spherical or Circular Probing), a training-free framework that easily defends against adversarial attacks generating NSFW images. Moreover, we develop an approach that utilizes one-step diffusion models for efficient NSFW detection (SC-Pro-o), further reducing computational resources. We demonstrate the superiority of our method in terms of performance and applicability.

Key Contributions

Identifies that adversarial attacks on Stable Diffusion safety checkers are brittle to small perturbations in text prompts or input latents, providing the theoretical basis for detection.
Proposes SC-Pro (Spherical or Circular Probing), a training-free defense framework that detects adversarial bypass attacks by probing local neighborhoods of the input.
Develops SC-Pro-o, an efficient variant leveraging one-step diffusion models to reduce the computational overhead of NSFW detection.

🛡️ Threat Analysis

Input Manipulation Attack

The threat model is adversarial attacks on safety-checker classifiers (text prompt and image classifiers in the Stable Diffusion pipeline) that craft inputs to cause the classifiers to misclassify NSFW content as safe. SC-Pro defends against these inference-time input manipulation attacks by exploiting the finding that such adversarial perturbations are fragile to small local changes in the text prompt or latent space.

Details

Domains

visiongenerative

Model Types

diffusiontransformer

Threat Tags

inference_timedigital

Applications

2025 4 cit.

Input Manipulation Attack

75%