attack 2026

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

Xiaohui Ye ¹, Yiwen Liu ¹, Lina Wang ¹, Run Wang ¹, Geying Yang ², Yufei Hou ¹, Jiayi Yu ¹

¹ Wuhan University

² Tianjin University

0 citations · 40 references · arXiv

Published on arXiv

2601.07141

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MacPrompt achieves 92% attack success rate on sex-related and 90% on violence-related NSFW content while bypassing major safety filters at up to 100%, outperforming existing baselines with up to 0.96 semantic similarity to harmful original prompts.

MacPrompt

Novel technique introduced

Text-to-image (T2I) models have raised increasing safety concerns due to their capacity to generate NSFW and other banned objects. To mitigate these risks, safety filters and concept removal techniques have been introduced to block inappropriate prompts or erase sensitive concepts from the models. However, all the existing defense methods are not well prepared to handle diverse adversarial prompts. In this work, we introduce MacPrompt, a novel black-box and cross-lingual attack that reveals previously overlooked vulnerabilities in T2I safety mechanisms. Unlike existing attacks that rely on synonym substitution or prompt obfuscation, MacPrompt constructs macaronic adversarial prompts by performing cross-lingual character-level recombination of harmful terms, enabling fine-grained control over both semantics and appearance. By leveraging this design, MacPrompt crafts prompts with high semantic similarity to the original harmful inputs (up to 0.96) while bypassing major safety filters (up to 100%). More critically, it achieves attack success rates as high as 92% for sex-related content and 90% for violence, effectively breaking even state-of-the-art concept removal defenses. These results underscore the pressing need to reassess the robustness of existing T2I safety mechanisms against linguistically diverse and fine-grained adversarial strategies.

Key Contributions

MacPrompt: a black-box cross-lingual attack constructing macaronic adversarial prompts via character-level substring recombination from translation-equivalent terms across multiple languages, preserving harmful semantics while evading text-based safety filters
Demonstrates simultaneous bypass of both input text/image filters and external concept removal defenses without access to model internals, achieving up to 100% filter bypass rate and 0.96 semantic similarity to harmful inputs
Exposes a new cross-lingual adversarial vulnerability in T2I safety mechanisms, achieving 92% attack success on sex-related and 90% on violence-related content against state-of-the-art concept removal defenses

🛡️ Threat Analysis

Details

Domains

generativemultimodalvision

Model Types

diffusion

Threat Tags

black_boxinference_time

Applications

text-to-image generationnsfw content filteringconcept removal defenses

Read PDF arXiv DOI

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

Conditioned Activation Transport for T2I Safety Steering

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

RunawayEvil: Jailbreaking the Image-to-Video Generative Models