defense 2026

Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

Yanzhu Liu , Xiao Liu , Yuexuan Wang , Mondal Soumik

0 citations · 52 references · arXiv

α

Published on arXiv

2601.20461

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Detector fine-tuned on DINOv3 achieves 98.83% average accuracy across 22 unseen generator test sets using only 100 training samples each from three representative generator categories.


With the rapid proliferation of powerful image generators, accurate detection of AI-generated images has become essential for maintaining a trustworthy online environment. However, existing deepfake detectors often generalize poorly to images produced by unseen generators. Notably, despite being trained under vastly different paradigms, such as diffusion or autoregressive modeling, many modern image generators share common final architectural components that serve as the last stage for converting intermediate representations into images. Motivated by this insight, we propose to "contaminate" real images using the generator's final component and train a detector to distinguish them from the original real images. We further introduce a taxonomy based on generators' final components and categorize 21 widely used generators accordingly, enabling a comprehensive investigation of our method's generalization capability. Using only 100 samples from each of three representative categories, our detector-fine-tuned on the DINOv3 backbone-achieves an average accuracy of 98.83% across 22 testing sets from unseen generators.


Key Contributions

  • Novel training strategy that 'contaminates' real images using a generator's final architectural component (e.g., decoder, de-tokenizer) to create training signal for AI-image detectors without requiring generated images directly
  • Taxonomy of 21 widely-used image generators categorized by their final architectural components, enabling principled generalization analysis
  • DINOv3-fine-tuned detector achieving 98.83% average accuracy across 22 unseen generator test sets using only 100 training samples per category

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection (deepfake/synthetic image detection), proposing a novel forensic method that leverages shared architectural components of image generators to train detectors that generalize across unseen generators — core output integrity and content authenticity work.


Details

Domains
visiongenerative
Model Types
diffusiongantransformer
Threat Tags
inference_timeblack_box
Datasets
22 unseen generator test sets (proprietary benchmark spanning diffusion, autoregressive, and GAN generators)
Applications
ai-generated image detectiondeepfake detection