benchmark 2025

AI-Generated Image Detection: An Empirical Study and Future Research Directions

Nusrat Tasnim 1,2, Kutub Uddin 2, Khalid Mahmood Malik 2

0 citations · 59 references · arXiv

α

Published on arXiv

2511.02791

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Substantial variability in generalization exists across methods, with strong in-distribution performance frequently degrading under cross-model transferability evaluation.


The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.


Key Contributions

  • Unified benchmarking framework for reproducible, systematic evaluation of deepfake detection methods across standardized training paradigms (scratch, frozen, fine-tuned)
  • Empirical study of 10 SoTA forensic methods across 7 GAN and diffusion-based datasets using multiple metrics (accuracy, AP, ROC-AUC, error rate, class-wise sensitivity)
  • Explainability analysis via Grad-CAM heatmaps and confidence curves to interpret detector decisions and identify generalization failure modes

🛡️ Threat Analysis

Output Integrity Attack

The paper's entire focus is evaluating forensic methods for detecting AI-generated images (deepfakes), which is output integrity and content authenticity verification — a core ML09 concern. The unified benchmarking framework, metrics, and findings all address the problem of distinguishing real from AI-generated content.


Details

Domains
visiongenerative
Model Types
cnntransformergandiffusion
Threat Tags
inference_time
Datasets
7 publicly available GAN and diffusion deepfake datasets (specific names not listed in excerpt)
Applications
deepfake detectionmultimedia forensicsmisinformation detection