defense 2025

SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack

Junha Park , Jaehui Hwang , Ian Ryu , Hyungkeun Park , Jiyoon Kim , Jong-Seok Lee

0 citations

α

Published on arXiv

2501.05359

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

SC-Pro successfully defends against adversarial attacks that bypass Stable Diffusion safety checkers without requiring any model retraining, while SC-Pro-o further reduces computational cost via one-step diffusion models.

SC-Pro (Spherical or Circular Probing)

Novel technique introduced


With advances in diffusion models, image generation has shown significant performance improvements. This raises concerns about the potential abuse of image generation, such as the creation of explicit or violent images, commonly referred to as Not Safe For Work (NSFW) content. To address this, the Stable Diffusion model includes several safety checkers to censor initial text prompts and final output images generated from the model. However, recent research has shown that these safety checkers have vulnerabilities against adversarial attacks, allowing them to generate NSFW images. In this paper, we find that these adversarial attacks are not robust to small changes in text prompts or input latents. Based on this, we propose SC-Pro (Spherical or Circular Probing), a training-free framework that easily defends against adversarial attacks generating NSFW images. Moreover, we develop an approach that utilizes one-step diffusion models for efficient NSFW detection (SC-Pro-o), further reducing computational resources. We demonstrate the superiority of our method in terms of performance and applicability.


Key Contributions

  • Identifies that adversarial attacks on Stable Diffusion safety checkers are brittle to small perturbations in text prompts or input latents, providing the theoretical basis for detection.
  • Proposes SC-Pro (Spherical or Circular Probing), a training-free defense framework that detects adversarial bypass attacks by probing local neighborhoods of the input.
  • Develops SC-Pro-o, an efficient variant leveraging one-step diffusion models to reduce the computational overhead of NSFW detection.

🛡️ Threat Analysis

Input Manipulation Attack

The threat model is adversarial attacks on safety-checker classifiers (text prompt and image classifiers in the Stable Diffusion pipeline) that craft inputs to cause the classifiers to misclassify NSFW content as safe. SC-Pro defends against these inference-time input manipulation attacks by exploiting the finding that such adversarial perturbations are fragile to small local changes in the text prompt or latent space.


Details

Domains
visiongenerative
Model Types
diffusiontransformer
Threat Tags
inference_timedigital
Applications
text-to-image generationnsfw content filteringimage synthesis safety