SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

The widespread deployment of text-to-image models is challenged by their potential to generate harmful content. While existing safety methods, such as prompt rewriting or model fine-tuning, provide valuable interventions, they often introduce a trade-off between safety and fidelity. Recent localization-based approaches have shown promise, yet their reliance on explicit ``concept replacement" can sometimes lead to semantic incongruity. To address these limitations, we explore a more flexible detect-then-suppress paradigm. We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content. Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative. A key aspect of our work is a novel training strategy using Direct Preference Optimization (DPO). We leverage readily available, image-level preference data to train our module, enabling it to learn nuanced suppression behaviors and perform region-guided interventions at inference without requiring costly, pixel-level annotations. Extensive experiments show that SafeCtrl significantly outperforms state-of-the-art methods in both safety efficacy and fidelity preservation. Our findings suggest that decoupled, suppression-based control is a highly effective and scalable direction for building more responsible generative models.

Key Contributions

SafeCtrl plugin with a novel unsafe attention module that precisely localizes harmful content regions in diffusion model outputs without modifying the frozen base model
Detect-then-suppress paradigm that suppresses harmful semantics rather than performing hard A-to-B concept replacement, allowing coherent context-aware safe alternatives to emerge naturally
Training strategy using image-level DPO preference data to learn region-guided suppression without requiring expensive pixel-level annotations

🛡️ Threat Analysis

Output Integrity Attack

SafeCtrl directly targets output integrity of generative AI models — it detects harmful semantics (nudity, violence) in diffusion model outputs and suppresses them before rendering, ensuring safe and appropriate content. The 'unsafe attention module' and DPO-trained suppression mechanism are squarely about maintaining output safety/integrity for AI-generated images.

Details

Domains

visiongenerative

Model Types

diffusiontransformer

Threat Tags

inference_time

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection

Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing

Generalizable and Adaptive Continual Learning Framework for AI-generated Image Detection

Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification

Exposing DeepFakes via Hyperspectral Domain Mapping

Training-free Detection of AI-generated images via Cropping Robustness