defense 2025

Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification

Xiangtao Meng 1, Yingkai Dong 1, Ning Yu 2, Li Wang , Zheng Li 1, Shanqing Guo 1

0 citations

α

Published on arXiv

2508.21099

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SafePatch reduces unsafe generation to 7% on I2P (vs. ~20% for all baselines) while preserving benign image quality, demonstrating robust defense against adversarial prompt attacks without the safety tax.

SafePatch

Novel technique introduced


Text-to-image (T2I) generative models have achieved remarkable visual fidelity, yet remain vulnerable to generating unsafe content. Existing safety defenses typically intervene internally within the generative model, but suffer from severe concept entanglement, leading to degradation of benign generation quality, a trade-off we term the Safety Tax. To overcome this limitation, we advocate a paradigm shift from destructive internal editing to external safety rectification. Following this principle, we propose SafePatch, a structurally isolated safety module that performs external, interpretable rectification without modifying the base model. The core backbone of SafePatch is architecturally instantiated as a trainable clone of the base model's encoder, allowing it to inherit rich semantic priors and maintain representation consistency. To enable interpretable safety rectification, we construct a strictly aligned counterfactual safety dataset (ACS) for differential supervision training. Across nudity and multi-category benchmarks and recent adversarial prompt attacks, SafePatch achieves robust unsafe suppression (7% unsafe on I2P) while preserving image quality and semantic alignment.


Key Contributions

  • SafePatch: a structurally isolated external safety module (cloned encoder architecture) that rectifies T2I generation without modifying base model weights, eliminating the 'Safety Tax' of concept entanglement
  • Aligned Counterfactual Safety (ACS) dataset: strictly paired unsafe/safe image pairs with identical benign semantics, enabling interpretable differential supervision for safety-specific rectification
  • Demonstrates 7% unsafe rate on I2P benchmark — substantially below baselines (~20%) — while maintaining image quality and semantic alignment under adversarial prompt attacks

🛡️ Threat Analysis


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
inference_time
Datasets
I2P
Applications
text-to-image generationnsfw content filteringsafe content generation