attack 2026

Identifying Models Behind Text-to-Image Leaderboards

Ali Naseh 1, Yuefeng Peng 2, Anshuman Suri 1, Harsh Chaudhari 2, Alina Oprea 2, Amir Houmansadr 1

0 citations · 59 references · arXiv

α

Published on arXiv

2601.09647

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Centroid-based classification in image embedding space achieves high deanonymization accuracy across 22 T2I models, with certain prompts enabling near-perfect model distinguishability, exposing a fundamental anonymization failure in T2I leaderboards.

Centroid-based T2I model deanonymization

Novel technique introduced


Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.


Key Contributions

  • Centroid-based deanonymization method that identifies which T2I model produced anonymized leaderboard outputs using image embedding clusters, without prompt control or training data access
  • Prompt-level distinguishability metric to quantify how much a given prompt separates model outputs, identifying prompts that yield near-perfect model attribution
  • Large-scale empirical analysis (22 models, 280 prompts, 150K images) exposing systematic model-specific signatures as a fundamental security flaw in voting-based T2I leaderboards

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary contribution is a model attribution attack — identifying which specific T2I model produced anonymized outputs using image embedding signatures. This is a content provenance attack: it breaks the anonymization guarantee by exploiting natural model-specific fingerprints in generated outputs, revealing which model's images are which. This falls under output integrity/content provenance, which ML09 covers.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
black_boxinference_time
Datasets
Custom dataset: 22 T2I models × 280 prompts = 150K images
Applications
text-to-image generationmodel evaluation leaderboardsai content attribution