defense 2025

NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation

Yitong Sun 1, Yao Huang 1, Ruochen Zhang 1, Huanran Chen 2, Shouwei Ruan 1, Ranjie Duan 3, Xingxing Wei 1

0 citations · 48 references · ACM MM

α

Published on arXiv

2510.15752

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

NDM outperforms SOTA safety methods (SLD, UCE, RECE) on both natural and adversarial implicit sexual prompt datasets while preserving the model's original generative quality.

NDM (Noise-driven Detection and Mitigation)

Novel technique introduced


Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.


Key Contributions

  • Noise-based detection method exploiting early-stage denoising noise separability to identify implicit malicious prompts with high accuracy and efficiency, without requiring fully generated images
  • Noise-enhanced adaptive negative guidance mechanism that suppresses prominent attention regions in initial noise to mitigate sexual content generation without degrading overall model quality
  • First unified framework (NDM) addressing both natural implicit and adversarially crafted harmful prompts in T2I generation, outperforming SLD, UCE, and RECE

🛡️ Threat Analysis

Input Manipulation Attack

The paper explicitly validates against adversarial datasets featuring SneakyPrompt-style attacks — token-level optimization methods (gradient/RL-based) used to bypass T2I safety filters. The noise-based detection and negative guidance defense directly counters these adversarially crafted input manipulations at inference time.


Details

Domains
visionnlpgenerativemultimodal
Model Types
diffusiontransformer
Threat Tags
inference_timedigitalblack_box
Datasets
natural implicit sexual prompt datasetadversarial prompt dataset (SneakyPrompt-style)
Applications
text-to-image generationcontent safety filtering