attack 2025

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Ali Naseh 1, Anshuman Suri 2, Yuefeng Peng 1, Harsh Chaudhari 2, Alina Oprea 2, Amir Houmansadr 1

0 citations · 27 references · arXiv

α

Published on arXiv

2510.06525

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Simple CLIP embedding space classification identifies the generating model with high accuracy across 19 models, with some models reaching 99% top-1 accuracy even without access to competing models' outputs.

CLIP-based model deanonymization with prompt-level separability metric

Novel technique introduced


Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.


Key Contributions

  • Demonstrates that text-to-image models leave strong identifiable signatures in CLIP embedding space, enabling high-accuracy model deanonymization across 19 diverse models without prompt control or historical data
  • Introduces a prompt-level separability metric that predicts deanonymization accuracy and identifies near-perfect attack prompts
  • Shows model deanonymization succeeds even in the most restrictive one-vs-rest setting (e.g., SDXL Turbo at 99% top-1 accuracy), underscoring leaderboard rank-manipulation risk

🛡️ Threat Analysis

Output Integrity Attack

The paper exploits identifiable model-specific signatures in generated image outputs — via CLIP embedding space — to attribute outputs to their source model, directly attacking content provenance and output anonymity assumptions in leaderboard systems. This is a content-provenance/output-integrity attack: the adversary recovers which model produced an output, undermining the integrity of AI evaluation pipelines. The paper also introduces a prompt-level separability metric to operationalize this attack, and calls for stronger defenses against such output-level attribution.


Details

Domains
visiongenerative
Model Types
diffusiontransformer
Threat Tags
black_boxinference_time
Datasets
150,000+ generated images across 280 prompts and 19 text-to-image models
Applications
text-to-image leaderboardsgenerative ai evaluation systems