benchmark 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam ¹, Nada Saadi ¹, Fahad Shamshad ¹, Nils Lukas ¹, Karthik Nandakumar ^1,2, Fahkri Karray ^1,3, Samuele Poppi ¹

¹ Mohamed bin Zayed University of Artificial Intelligence

² Michigan State University

³ University of Waterloo

0 citations · 44 references · arXiv

Published on arXiv

2511.19558

Output Integrity Attack

OWASP ML Top 10 — ML09

Transfer Learning Attack

OWASP ML Top 10 — ML07

Key Finding

Current safety-aligned T2I diffusion models frequently lose their safety properties after routine benign fine-tuning, exposing a systematic robustness gap that existing evaluations do not measure.

SPQR

Novel technique introduced

Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.

Key Contributions

SPQR benchmark: a single harmonic-mean score aggregating Safety, Prompt-adherence, Quality, and Robustness for T2I safety alignment evaluation
Empirical demonstration that current safety alignment methods frequently regress after benign downstream fine-tuning (LoRA, style adapters)
Multilingual, domain-specific, and OOD analyses providing granular category-wise breakdowns of when and how safety alignment fails post-adaptation

🛡️ Threat Analysis

Transfer Learning Attack

The benchmark's central novel dimension is robustness of safety under benign fine-tuning (LoRA personalization, style/domain adapters). This is squarely about whether safety properties survive the transfer learning/fine-tuning process — the precise gap between pre-training alignment and post-deployment adaptation that ML07 targets, even if the fine-tuning is not adversarially motivated.

Output Integrity Attack

Safety alignment in T2I models is a direct output integrity concern — preventing models from emitting unsafe, copyrighted, or private content. SPQR benchmarks how well these output-integrity defenses hold up, and the core finding is that alignment frequently breaks down after routine fine-tuning, exposing an output integrity failure mode.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

inference_time

Datasets

SPQR benchmark (introduced in paper)

Applications

text-to-image generationcontent safety filteringdiffusion model alignment

Read PDF arXiv DOI

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Training-free Source Attribution of AI-generated Images via Resynthesis

A Comprehensive Dataset for Human vs. AI Generated Image Detection

DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis

WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

The Unwinnable Arms Race of AI Image Detection

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing

Side Effects of Erasing Concepts from Diffusion Models

UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization