Identifying Models Behind Text-to-Image Leaderboards
Ali Naseh 1, Yuefeng Peng 2, Anshuman Suri 1, Harsh Chaudhari 2, Alina Oprea 2, Amir Houmansadr 1
Published on arXiv
2601.09647
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Centroid-based classification in image embedding space achieves high deanonymization accuracy across 22 T2I models, with certain prompts enabling near-perfect model distinguishability, exposing a fundamental anonymization failure in T2I leaderboards.
Centroid-based T2I model deanonymization
Novel technique introduced
Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
Key Contributions
- Centroid-based deanonymization method that identifies which T2I model produced anonymized leaderboard outputs using image embedding clusters, without prompt control or training data access
- Prompt-level distinguishability metric to quantify how much a given prompt separates model outputs, identifying prompts that yield near-perfect model attribution
- Large-scale empirical analysis (22 models, 280 prompts, 150K images) exposing systematic model-specific signatures as a fundamental security flaw in voting-based T2I leaderboards
🛡️ Threat Analysis
The paper's primary contribution is a model attribution attack — identifying which specific T2I model produced anonymized outputs using image embedding signatures. This is a content provenance attack: it breaks the anonymization guarantee by exploiting natural model-specific fingerprints in generated outputs, revealing which model's images are which. This falls under output integrity/content provenance, which ML09 covers.