attack 2026

$PC^2$: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models

Wonwoo Choi , Minjae Seo , Minkyoo Song , Hwanjo Heo , Seungwon Shin , Myoungsung You

0 citations · 31 references · arXiv

α

Published on arXiv

2601.05150

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PC² achieves up to 86% attack success rate on GPT-series T2I models, bypassing filters that blocked 100% of original unobfuscated prompts.

PC²

Novel technique introduced


The rapid evolution of text-to-image (T2I) models has enabled high-fidelity visual synthesis on a global scale. However, these advancements have introduced significant security risks, particularly regarding the generation of harmful content. Politically harmful content, such as fabricated depictions of public figures, poses severe threats when weaponized for fake news or propaganda. Despite its criticality, the robustness of current T2I safety filters against such politically motivated adversarial prompting remains underexplored. In response, we propose $PC^2$, the first black-box political jailbreaking framework for T2I models. It exploits a novel vulnerability where safety filters evaluate political sensitivity based on linguistic context. $PC^2$ operates through: (1) Identity-Preserving Descriptive Mapping to obfuscate sensitive keywords into neutral descriptions, and (2) Geopolitically Distal Translation to map these descriptions into fragmented, low-sensitivity languages. This strategy prevents filters from constructing toxic relationships between political entities within prompts, effectively bypassing detection. We construct a benchmark of 240 politically sensitive prompts involving 36 public figures. Evaluation on commercial T2I models, specifically GPT-series, shows that while all original prompts are blocked, $PC^2$ achieves attack success rates of up to 86%.


Key Contributions

  • PC², the first black-box political jailbreaking framework for T2I models, exploiting linguistic context sensitivity in safety filters
  • Identity-Preserving Descriptive Mapping: obfuscates politically sensitive keywords into neutral descriptive surrogates
  • Geopolitically Distal Translation: fragments prompts into low-sensitivity language segments to prevent filters from reconstructing toxic entity relationships

🛡️ Threat Analysis


Details

Domains
nlpvisionmultimodalgenerative
Model Types
llmvlmdiffusionmultimodal
Threat Tags
black_boxinference_timetargeted
Datasets
Custom benchmark of 240 politically sensitive prompts involving 36 public figures
Applications
text-to-image generationpolitical content moderationai safety filters