defense 2025

Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models

Fan Yang 1, Yihao Huang 1,2, Jiayi Zhu 3, Ling Shi 4,3, Geguang Pu 3, Jin Song Dong 2, Kailong Wang 1

0 citations

α

Published on arXiv

2508.03006

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

IGD achieves 92.45% average NSFW detection accuracy across 7 categories on both naive and adversarially crafted prompts, outperforming 7 baseline methods including prompt filters and image-level moderators.

IGD (In-Generation Detection)

Novel technique introduced


Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.


Key Contributions

  • Identifies predicted noise in diffusion denoising steps as a discriminative semantic signal that separates NSFW from SFW prompts, including adversarially obfuscated ones
  • Proposes IGD (In-Generation Detection), a lightweight classifier trained on mid-generation noise features enabling early NSFW intervention before image synthesis completes
  • Demonstrates IGD achieves 92.45% average detection accuracy across seven NSFW categories, outperforming seven baseline pre- and post-detection methods while remaining robust to adversarial prompts

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel in-generation AI-generated content detection method that identifies NSFW outputs from diffusion models before full synthesis completes, using predicted noise as an output-integrity signal — squarely within ML09's scope of AI-generated content detection and output integrity.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
inference_timedigitalblack_box
Applications
text-to-image generationnsfw content moderationdiffusion model safety