JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
Haolun Zheng 1,2, Yu He 1,2, Tailun Chen 1,2, Shuo Shao 1,2, Zhixuan Chu 1,2, Hongbin Zhou 3, Lan Tao 3, Zhan Qin 1,2, Kui Ren 1,2
Published on arXiv
2603.21208
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves 43.15% attack success rate (ASR-8) on Stable Diffusion 3.5 Large Turbo, improving from prior SOTA of 25.30%
JANUS
Novel technique introduced
Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.
Key Contributions
- Lightweight distribution optimization framework (JANUS) for jailbreaking T2I models without requiring large-scale RL-trained generators
- Low-dimensional mixing policy over semantically anchored prompt distributions for efficient black-box exploration
- Achieves 43.15% ASR-8 on Stable Diffusion 3.5 Large Turbo, outperforming prior SOTA by 70% relative improvement
🛡️ Threat Analysis
Jailbreak attack that manipulates text inputs to T2I models to evade safety filters and generate prohibited content at inference time.