Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security
Ali Naseh 1, Anshuman Suri 2, Yuefeng Peng 1, Harsh Chaudhari 2, Alina Oprea 2, Amir Houmansadr 1
Published on arXiv
2510.06525
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Simple CLIP embedding space classification identifies the generating model with high accuracy across 19 models, with some models reaching 99% top-1 accuracy even without access to competing models' outputs.
CLIP-based model deanonymization with prompt-level separability metric
Novel technique introduced
Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.
Key Contributions
- Demonstrates that text-to-image models leave strong identifiable signatures in CLIP embedding space, enabling high-accuracy model deanonymization across 19 diverse models without prompt control or historical data
- Introduces a prompt-level separability metric that predicts deanonymization accuracy and identifies near-perfect attack prompts
- Shows model deanonymization succeeds even in the most restrictive one-vs-rest setting (e.g., SDXL Turbo at 99% top-1 accuracy), underscoring leaderboard rank-manipulation risk
🛡️ Threat Analysis
The paper exploits identifiable model-specific signatures in generated image outputs — via CLIP embedding space — to attribute outputs to their source model, directly attacking content provenance and output anonymity assumptions in leaderboard systems. This is a content-provenance/output-integrity attack: the adversary recovers which model produced an output, undermining the integrity of AI evaluation pipelines. The paper also introduces a prompt-level separability metric to operationalize this attack, and calls for stronger defenses against such output-level attribution.