Fixed-Threshold Evaluation of a Hybrid CNN-ViT for AI-Generated Image Detection Across Photos and Art
Md Ashik Khan 1, Arafat Alam Jion 2
Published on arXiv
2512.21512
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Fixed-threshold evaluation reveals CNNs collapse under JPEG Q60 compression (93.33% → 61.49%) while ViTs sustain robustness (88.36%), and the hybrid achieves 91.4% on photos, 89.7% on art, and 98.3% on CIFAKE with AUROC 0.9977
Fixed-Threshold Evaluation Protocol with Gated CNN-ViT Fusion
Novel technique introduced
AI image generators create both photorealistic images and stylized art, necessitating robust detectors that maintain performance under common post-processing transformations (JPEG compression, blur, downscaling). Existing methods optimize single metrics without addressing deployment-critical factors such as operating point selection and fixed-threshold robustness. This work addresses misleading robustness estimates by introducing a fixed-threshold evaluation protocol that holds decision thresholds, selected once on clean validation data, fixed across all post-processing transformations. Traditional methods retune thresholds per condition, artificially inflating robustness estimates and masking deployment failures. We report deployment-relevant performance at three operating points (Low-FPR, ROC-optimal, Best-F1) under systematic degradation testing using a lightweight CNN-ViT hybrid with gated fusion and optional frequency enhancement. Our evaluation exposes a statistically validated forensic-semantic spectrum: frequency-aided CNNs excel on pristine photos but collapse under compression (93.33% to 61.49%), whereas ViTs degrade minimally (92.86% to 88.36%) through robust semantic pattern recognition. Multi-seed experiments demonstrate that all architectures achieve 15% higher AUROC on artistic content (0.901-0.907) versus photorealistic images (0.747-0.759), confirming that semantic patterns provide fundamentally more reliable detection cues than forensic artifacts. Our hybrid approach achieves balanced cross-domain performance: 91.4% accuracy on tiny-genimage photos, 89.7% on AiArtData art/graphics, and 98.3% (competitive) on CIFAKE. Fixed-threshold evaluation eliminates retuning inflation, reveals genuine robustness gaps, and yields actionable deployment guidance: prefer CNNs for clean photo verification, ViTs for compressed content, and hybrids for art/graphics screening.
Key Contributions
- Fixed-threshold evaluation protocol that holds decision thresholds constant across post-processing distortions, eliminating the artificial robustness inflation caused by per-condition retuning
- Statistically validated forensic-semantic spectrum: frequency-aided CNNs collapse under JPEG compression (93.33% → 61.49%) while ViTs maintain robustness (92.86% → 88.36%), guiding deployment-stage architecture selection
- Cross-domain finding that all architectures achieve ~15% higher AUROC on artistic content (0.901–0.907) versus photorealistic images (0.747–0.759), confirming semantic cues are more reliable detection signals than forensic artifacts
🛡️ Threat Analysis
Directly advances AI-generated image detection (a core ML09 concern) by proposing a novel fixed-threshold evaluation methodology and a CNN-ViT hybrid detector evaluated on photorealistic and artistic content — contributing both a detection architecture and a more rigorous robustness measurement framework for synthetic content identification.