FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.

Key Contributions

Novel linear approximation for latent decoding in diffusion models that enables lightweight in-generation safety detection
Curriculum learning approach to stabilize training on noisy intermediate diffusion states
Cross-model detection framework achieving 30%+ F1 improvement over existing methods while reducing GPU memory by 97% and projection time from 8.1s to 0.2s

🛡️ Threat Analysis

Output Integrity Attack

FlowGuard is a safety detection system that verifies output integrity during the generation process of diffusion models. It detects NSFW (Not-Safe-For-Work) content by inspecting intermediate denoising steps, ensuring that generated outputs conform to safety standards. This is output integrity verification — detecting and preventing unsafe content generation before the final image is produced.