defense 2026

Unknown Aware AI-Generated Content Attribution

Ellie Thieu , Jifan Zhang , Haoyue Bai

0 citations · 59 references · arXiv

α

Published on arXiv

2601.00218

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Incorporating unlabeled wild Internet data via constrained optimization substantially improves attribution of DALL·E 3 images against challenging unseen generators including Midjourney, Firefly, and Stable Diffusion XL.

Unknown-Aware Constrained Attribution

Novel technique introduced


The rapid advancement of photorealistic generative models has made it increasingly important to attribute the origin of synthetic content, moving beyond binary real or fake detection toward identifying the specific model that produced a given image. We study the problem of distinguishing outputs from a target generative model (e.g., OpenAI Dalle 3) from other sources, including real images and images generated by a wide range of alternative models. Using CLIP features and a simple linear classifier, shown to be effective in prior work, we establish a strong baseline for target generator attribution using only limited labeled data from the target model and a small number of known generators. However, this baseline struggles to generalize to harder, unseen, and newly released generators. To address this limitation, we propose a constrained optimization approach that leverages unlabeled wild data, consisting of images collected from the Internet that may include real images, outputs from unknown generators, or even samples from the target model itself. The proposed method encourages wild samples to be classified as non target while explicitly constraining performance on labeled data to remain high. Experimental results show that incorporating wild data substantially improves attribution performance on challenging unseen generators, demonstrating that unlabeled data from the wild can be effectively exploited to enhance AI generated content attribution in open world settings.


Key Contributions

  • Constrained optimization framework that fine-tunes a CLIP-based classifier using unlabeled wild Internet images while preserving labeled in-distribution performance, preventing catastrophic forgetting.
  • Open-world attribution formulation for identifying target generator outputs (e.g., DALL·E 3) in the presence of unknown or newly released generators not seen during training.
  • Empirical demonstration that unlabeled wild data substantially improves attribution against unseen generators such as Midjourney, Firefly, and Stable Diffusion XL.

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content provenance and attribution — determining which specific generative model (e.g., DALL·E 3) produced a given image. Proposes a novel constrained optimization framework for open-world content attribution, extending beyond binary real/fake detection to fine-grained source attribution, which is a core output integrity and content authenticity concern.


Details

Domains
visiongenerative
Model Types
diffusiongantransformer
Threat Tags
inference_timeblack_box
Datasets
DALL·E 3 generated imagesMidjourneyAdobe FireflyStable Diffusion XLweb-collected wild images
Applications
ai-generated image attributioncontent provenance trackinggenerative model forensics