attack arXiv Oct 7, 2025 · Oct 2025
Ali Naseh, Anshuman Suri, Yuefeng Peng et al. · University of Massachusetts Amherst · Northeastern University
Deanonymizes text-to-image leaderboard models via CLIP embedding signatures, enabling rank manipulation attacks with near-perfect accuracy
Output Integrity Attack visiongenerative
Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs -- a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.
diffusion transformer University of Massachusetts Amherst · Northeastern University
attack arXiv Jan 14, 2026 · 11w ago
Ali Naseh, Yuefeng Peng, Anshuman Suri et al. · University of Massachusetts Amherst · Northeastern University
Attacks T2I leaderboard anonymity by clustering model outputs in embedding space, deanonymizing 22 models from 150K images
Output Integrity Attack visiongenerative
Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
diffusion University of Massachusetts Amherst · Northeastern University
attack arXiv Jan 27, 2026 · 9w ago
Harsh Chaudhari, Ethan Rathbun, Hanna Foerster et al. · Northeastern University · University of Cambridge +4 more
Poisons LLM CoT training data by corrupting reasoning traces to inject targeted behaviors into unseen domains without altering queries or answers
Data Poisoning Attack Training Data Poisoning nlp
Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models' capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our "Thought-Transfer" attack can influence the LLM output on a target task by manipulating only the training samples' CoT traces, while leaving the queries and answers unchanged, resulting in a form of ``clean label'' poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70% success rates in injecting targeted behaviors into entirely different domains that are never present in training. Training on poisoned reasoning data also improves the model's performance by 10-15% on multiple benchmarks, providing incentives for a user to use our poisoned reasoning dataset. Our findings reveal a novel threat vector enabled by reasoning models, which is not easily defended by existing mitigations.
llm transformer Northeastern University · University of Cambridge · Google DeepMind +3 more