benchmark 2025

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

0 citations

Published on arXiv

2509.09172

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Current AI-generated image detectors significantly degrade under real-world social media transmission and re-digitization conditions, while human few-shot learning demonstrates greater adaptability than automated detectors.

RRDataset

Novel technique introduced

With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.

Key Contributions

RRDataset: a new benchmark covering 7 real-world content scenarios (war, disasters, political events, medical, culture, labor, everyday life) to evaluate AI-generated image detectors
Evaluation of 17 detectors and 10 VLMs under two real-world degradation conditions: internet transmission across social media platforms and four re-digitization methods
Large-scale human study (192 participants) revealing human few-shot adaptability as a signal for developing more robust detection algorithms

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary contribution is evaluating AI-generated image detectors — directly addressing output integrity and content authenticity. RRDataset tests whether detectors can distinguish synthetic from real images under real-world distribution shifts (social media sharing, re-digitization), which is a core ML09 benchmark contribution.

Details

Domains

visionmultimodal

Model Types

vlmcnntransformer

Threat Tags

inference_time

Datasets

RRDataset

Applications

ai-generated image detectiondeepfake detectionmedia credibility

Read PDF arXiv

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds

Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos

On the Holistic Approach for Detecting Human Image Forgery

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Mirage: Unveiling Hidden Artifacts in Synthetic Images with Large Vision-Language Models

Fit for Purpose? Deepfake Detection in the Real World

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization