attack 2026

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

Xiaohui Ye 1, Yiwen Liu 1, Lina Wang 1, Run Wang 1, Geying Yang 2, Yufei Hou 1, Jiayi Yu 1

0 citations · 40 references · arXiv

α

Published on arXiv

2601.07141

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MacPrompt achieves 92% attack success rate on sex-related and 90% on violence-related NSFW content while bypassing major safety filters at up to 100%, outperforming existing baselines with up to 0.96 semantic similarity to harmful original prompts.

MacPrompt

Novel technique introduced


Text-to-image (T2I) models have raised increasing safety concerns due to their capacity to generate NSFW and other banned objects. To mitigate these risks, safety filters and concept removal techniques have been introduced to block inappropriate prompts or erase sensitive concepts from the models. However, all the existing defense methods are not well prepared to handle diverse adversarial prompts. In this work, we introduce MacPrompt, a novel black-box and cross-lingual attack that reveals previously overlooked vulnerabilities in T2I safety mechanisms. Unlike existing attacks that rely on synonym substitution or prompt obfuscation, MacPrompt constructs macaronic adversarial prompts by performing cross-lingual character-level recombination of harmful terms, enabling fine-grained control over both semantics and appearance. By leveraging this design, MacPrompt crafts prompts with high semantic similarity to the original harmful inputs (up to 0.96) while bypassing major safety filters (up to 100%). More critically, it achieves attack success rates as high as 92% for sex-related content and 90% for violence, effectively breaking even state-of-the-art concept removal defenses. These results underscore the pressing need to reassess the robustness of existing T2I safety mechanisms against linguistically diverse and fine-grained adversarial strategies.


Key Contributions

  • MacPrompt: a black-box cross-lingual attack constructing macaronic adversarial prompts via character-level substring recombination from translation-equivalent terms across multiple languages, preserving harmful semantics while evading text-based safety filters
  • Demonstrates simultaneous bypass of both input text/image filters and external concept removal defenses without access to model internals, achieving up to 100% filter bypass rate and 0.96 semantic similarity to harmful inputs
  • Exposes a new cross-lingual adversarial vulnerability in T2I safety mechanisms, achieving 92% attack success on sex-related and 90% on violence-related content against state-of-the-art concept removal defenses

🛡️ Threat Analysis


Details

Domains
generativemultimodalvision
Model Types
diffusion
Threat Tags
black_boxinference_time
Applications
text-to-image generationnsfw content filteringconcept removal defenses