attack 2026

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Haolun Zheng 1,2, Yu He 1,2, Tailun Chen 1,2, Shuo Shao 1,2, Zhixuan Chu 1,2, Hongbin Zhou 3, Lan Tao 3, Zhan Qin 1,2, Kui Ren 1,2

0 citations

α

Published on arXiv

2603.21208

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 43.15% attack success rate (ASR-8) on Stable Diffusion 3.5 Large Turbo, improving from prior SOTA of 25.30%

JANUS

Novel technique introduced


Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.


Key Contributions

  • Lightweight distribution optimization framework (JANUS) for jailbreaking T2I models without requiring large-scale RL-trained generators
  • Low-dimensional mixing policy over semantically anchored prompt distributions for efficient black-box exploration
  • Achieves 43.15% ASR-8 on Stable Diffusion 3.5 Large Turbo, outperforming prior SOTA by 70% relative improvement

🛡️ Threat Analysis

Input Manipulation Attack

Jailbreak attack that manipulates text inputs to T2I models to evade safety filters and generate prohibited content at inference time.


Details

Domains
visiongenerativemultimodal
Model Types
diffusiontransformermultimodal
Threat Tags
black_boxinference_timetargeted
Datasets
NSFW prompts
Applications
text-to-image generationsafety filter evasion