Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

Key Contributions

SynthCLIC: a paired dataset of real photographs and high-quality diffusion-generated synthetic counterparts designed to reduce semantic bias in synthetic image detection
Interpretability analysis showing CLIP-based detectors rely on high-level photographic attributes (minimalist style, lens flare, depth layering) rather than low-level generator artifacts
Quantitative evaluation showing CLIP-based detectors generalize unevenly — 0.96 mAP on GAN benchmarks but as low as 0.37 mAP across generator families

🛡️ Threat Analysis

Output Integrity Attack

Core contribution is understanding and evaluating AI-generated (synthetic) image detection systems — a direct application of output integrity and content authenticity verification. The paper introduces SynthCLIC dataset and analyzes detection cues for distinguishing real from AI-generated images.

Details

Domains

visiongenerative

Model Types

diffusionganvlm

Datasets

SynthCLICGAN-based benchmark

Applications

2025 0 cit.

Output Integrity Attack

77%

Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Your One-Stop Solution for AI-Generated Video Detection

UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization

WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Scaling Laws for Deepfake Detection

FakeParts: a New Family of AI-Generated DeepFakes

RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios