defense 2026

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Muli Yang 1, Gabriel James Goenawan 1, Henan Wang 2, Huaiyuan Qin 1, Chenghao Xu 3, Yanhua Yang 3, Fen Fang 1, Ying Sun 1, Joo-Hwee Lim 1, Hongyuan Zhu 1

0 citations · 73 references · arXiv (Cornell University)

α

Published on arXiv

2602.01973

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Post-hoc logit calibration significantly improves cross-generator detection robustness without retraining, outperforming uncalibrated baselines on challenging open-world benchmarks.

AIGI-Det-Calib

Novel technique introduced


Despite being trained on balanced datasets, existing AI-generated image detectors often exhibit systematic bias at test time, frequently misclassifying fake images as real. We hypothesize that this behavior stems from distributional shift in fake samples and implicit priors learned during training. Specifically, models tend to overfit to superficial artifacts that do not generalize well across different generation methods, leading to a misaligned decision threshold when faced with test-time distribution shift. To address this, we propose a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. In particular, we introduce a learnable scalar correction to the model's logits, optimized on a small validation set from the target distribution while keeping the backbone frozen. This parametric adjustment compensates for distributional shift in model output, realigning the decision boundary even without requiring ground-truth labels. Experiments on challenging benchmarks show that our approach significantly improves robustness without retraining, offering a lightweight and principled solution for reliable and adaptive AI-generated image detection in the open world. Code is available at https://github.com/muliyangm/AIGI-Det-Calib.


Key Contributions

  • Identification of systematic threshold misalignment in AI-generated image detectors caused by distributional shift in fake samples and implicit training priors
  • Theoretically grounded post-hoc calibration framework using Bayesian decision theory with a learnable scalar logit correction optimized on a small unlabeled validation set
  • Demonstrated significant robustness improvements on challenging benchmarks without backbone retraining, enabling lightweight deployment adaptation

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated image detection (synthetic content authenticity), proposing a novel calibration method to improve detectors' output integrity under test-time distributional shift — squarely within the content provenance and AI-generated content detection sub-domain of ML09.


Details

Domains
visiongenerative
Model Types
cnntransformerdiffusiongan
Threat Tags
inference_timeblack_box
Applications
ai-generated image detectiondeepfake detectionsynthetic image forensics