attack 2025

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Ali Naseh ¹, Anshuman Suri ², Yuefeng Peng ¹, Harsh Chaudhari ², Alina Oprea ², Amir Houmansadr ¹

¹ University of Massachusetts Amherst

² Northeastern University

0 citations · 27 references · arXiv

Published on arXiv

2510.06525

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Simple CLIP embedding space classification identifies the generating model with high accuracy across 19 models, with some models reaching 99% top-1 accuracy even without access to competing models' outputs.

CLIP-based model deanonymization with prompt-level separability metric

Novel technique introduced

Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.

Key Contributions

Demonstrates that text-to-image models leave strong identifiable signatures in CLIP embedding space, enabling high-accuracy model deanonymization across 19 diverse models without prompt control or historical data
Introduces a prompt-level separability metric that predicts deanonymization accuracy and identifies near-perfect attack prompts
Shows model deanonymization succeeds even in the most restrictive one-vs-rest setting (e.g., SDXL Turbo at 99% top-1 accuracy), underscoring leaderboard rank-manipulation risk

🛡️ Threat Analysis

Output Integrity Attack

The paper exploits identifiable model-specific signatures in generated image outputs — via CLIP embedding space — to attribute outputs to their source model, directly attacking content provenance and output anonymity assumptions in leaderboard systems. This is a content-provenance/output-integrity attack: the adversary recovers which model produced an output, undermining the integrity of AI evaluation pipelines. The paper also introduces a prompt-level separability metric to operationalize this attack, and calls for stronger defenses against such output-level attribution.

Details

Domains

visiongenerative

Model Types

diffusiontransformer

Threat Tags

black_boxinference_time

Datasets

150,000+ generated images across 280 prompts and 19 text-to-image models

Applications

text-to-image leaderboardsgenerative ai evaluation systems

Read PDF arXiv DOI

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

TokenPure: Watermark Removal through Tokenized Appearance and Structural Guidance

Purify Once, Edit Freely: Breaking Image Protections under Model Mismatch

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Identifying Models Behind Text-to-Image Leaderboards

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Understanding Semantic Perturbations on In-Processing Generative Image Watermarks

MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation

D2RA: Dual Domain Regeneration Attack