benchmark 2025

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

Chenyu Zhang , Tairen Zhang , Lanjun Wang , Ruidong Chen , Wenhui Li , Anan Liu

Tianjin University

1 citations · 41 references · arXiv

Published on arXiv

2510.22300

Output Integrity Attack

OWASP ML Top 10 — ML09

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Current T2I models exhibit significant safety risks that worsen as generative quality improves; jailbreaking attacks succeed against individual defenses but struggle when multiple defense mechanisms are applied simultaneously.

T2I-RiskyPrompt

Novel technique introduced

Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in three key areas: 1) limited risky categories, 2) coarse-grained annotation, and 3) low effectiveness. To address these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark designed for evaluating safety-related tasks in T2I models. Specifically, we first develop a hierarchical risk taxonomy, which consists of 6 primary categories and 14 fine-grained subcategories. Building upon this taxonomy, we construct a pipeline to collect and annotate risky prompts. Finally, we obtain 6,432 effective risky prompts, where each prompt is annotated with both hierarchical category labels and detailed risk reasons. Moreover, to facilitate the evaluation, we propose a reason-driven risky image detection method that explicitly aligns the MLLM with safety annotations. Based on T2I-RiskyPrompt, we conduct a comprehensive evaluation of eight T2I models, nine defense methods, five safety filters, and five attack strategies, offering nine key insights into the strengths and limitations of T2I model safety. Finally, we discuss potential applications of T2I-RiskyPrompt across various research fields. The dataset and code are provided in https://github.com/datar001/T2I-RiskyPrompt.

Key Contributions

A hierarchical risk taxonomy with 6 primary and 14 fine-grained subcategories, and a collection pipeline yielding 6,432 annotated risky prompts with category labels and risk reasons.
A reason-driven risky image detection method that aligns a multimodal LLM (MLLM) with safety annotations, outperforming existing open-source detectors.
Comprehensive evaluation of 8 T2I models, 9 defense methods, 5 safety filters, and 5 attack strategies, with 9 key insights on T2I safety strengths and limitations.

🛡️ Threat Analysis

Output Integrity Attack

The paper proposes a reason-driven risky image detection method to identify unsafe AI-generated outputs and evaluates five safety filters for T2I model output integrity — directly targeting output authenticity and content safety of generative models.

Details

Domains

visiongenerativemultimodal

Model Types

diffusiontransformer

Threat Tags

black_boxinference_time

Datasets

T2I-RiskyPrompt (6,432 risky prompts, introduced by authors)

Applications

text-to-image generationcontent moderationai safety evaluation

Read PDF arXiv DOI Code

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences

ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs

MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models

Conditioned Activation Transport for T2I Safety Steering

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Side Effects of Erasing Concepts from Diffusion Models

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing