DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics
Jieming Yu 1, Qiuxiao Feng 1, Zhuohan Wang 2, Xiaochen Ma 1
Published on arXiv
2604.16083
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves 17.0-point average pixel-level F1 improvement over previous SOTA on four standard benchmarks; LoRA reaches 0.774 F1 vs 0.530 for strongest prior method under data-scarce conditions
DINOv3-LoRA-IML
Novel technique introduced
With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.
Key Contributions
- DINOv3 + LoRA baseline outperforms specialized forensic detectors by 17.0 F1 points with only 9.1M trainable parameters
- LoRA adaptation consistently beats full fine-tuning across all scales and proves robust in data-scarce scenarios
- Strong robustness to common image degradations (JPEG compression, Gaussian noise, blur)
🛡️ Threat Analysis
Detects and localizes AI-generated fake images and manipulations — core output integrity / content authenticity problem. This is deepfake/manipulation detection, which falls under ML09.