attack arXiv Oct 7, 2025 · Oct 2025
Ali Naseh, Anshuman Suri, Yuefeng Peng et al. · University of Massachusetts Amherst · Northeastern University
Deanonymizes text-to-image leaderboard models via CLIP embedding signatures, enabling rank manipulation attacks with near-perfect accuracy
Output Integrity Attack visiongenerative
Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.
diffusion transformer University of Massachusetts Amherst · Northeastern University
attack arXiv Jan 14, 2026 · 11w ago
Ali Naseh, Yuefeng Peng, Anshuman Suri et al. · University of Massachusetts Amherst · Northeastern University
Attacks T2I leaderboard anonymity by clustering model outputs in embedding space, deanonymizing 22 models from 150K images
Output Integrity Attack visiongenerative
Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
diffusion University of Massachusetts Amherst · Northeastern University