How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$ρ$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $χ^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.

Key Contributions

First comprehensive zero-shot evaluation of 16 SOTA AI-generated image detectors (23 variants) across 12 datasets spanning 2.6M samples and 291 unique generators
Identifies three systematic failure patterns causing cross-dataset generalization failures, including up to 20–60% performance variance within architecturally identical detector families
Provides actionable deployment guidelines showing practitioners must select detectors based on their specific threat landscape rather than published benchmark rankings

🛡️ Threat Analysis

Output Integrity Attack

Directly evaluates AI-generated image detection (deepfake detection) — the canonical ML09 use case of verifying content authenticity and detecting synthetic media outputs.

Details

Domains

visiongenerative

Model Types

cnntransformerdiffusiongan

Threat Tags

inference_timeblack_box

Datasets

12 diverse AI-generated image datasets (2.6M samples, 291 generators including Flux Dev, Firefly v4, Midjourney v7)

Applications

2025 1 cit.

Output Integrity Attack

86%

How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

AI-Generated Image Detection: An Empirical Study and Future Research Directions

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

Além do Desempenho: Um Estudo da Confiabilidade de Detectores de Deepfakes

Deepfake Synthesis vs. Detection: An Uneven Contest

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

Towards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery Faces

Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem