attack 2026

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed B Mustafa 1, Zihan Ye 2, Yang Lu 3, Michael P Pound 1, Shreyank N Gowda 1

0 citations

α

Published on arXiv

2604.01888

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves up to 74.47% attack success rate bypassing safety filters across state-of-the-art text-to-image systems using only natural language prompts

Visual Jailbreak Taxonomy

Novel technique introduced


Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.


Key Contributions

  • Taxonomy of 5 visual jailbreak strategies (artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, ambiguous action substitution)
  • Systematic evaluation showing prompt-based attacks achieve up to 74.47% ASR across modern text-to-image models
  • Demonstrates critical gap between surface-level prompt filtering and semantic understanding of adversarial intent

🛡️ Threat Analysis


Details

Domains
multimodalgenerative
Model Types
diffusionmultimodal
Threat Tags
black_boxinference_timetargeted
Applications
text-to-image generationcontent moderation