defense 2025

DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi 1,2, Jan Butora 2, Vincent Itier 1,2, Jérémie Boulanger 2, Patrick Bas 2

0 citations · 39 references · arXiv

α

Published on arXiv

2511.20722

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

DinoLizer achieves on average 12% higher IoU than the next best localization model across diverse generative inpainting benchmarks, with even larger gains under post-processing degradation.

DinoLizer

Novel technique introduced


We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer's patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12\% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer's superiority. The code will be publicly available upon acceptance of the paper.


Key Contributions

  • DINOv2-based image forgery localization model with a linear classification head over ViT patch embeddings, trained to distinguish semantically altered vs. original regions at 14×14 patch resolution
  • Sliding-window inference strategy to handle variable-size images, with heatmap fusion and post-processing to produce binary manipulation masks
  • Extensive ablation comparing DINOv2 and DINOv3 backbones; achieves 12% higher average IoU than prior state-of-the-art across multiple generative inpainting datasets and remains robust to JPEG compression, resizing, and noise

🛡️ Threat Analysis

Output Integrity Attack

DinoLizer is a deepfake/AI-generated content detector that localizes manipulated regions produced by generative inpainting models — directly addresses output integrity and AI-generated content detection as defined in ML09.


Details

Domains
vision
Model Types
transformer
Threat Tags
inference_time
Datasets
BtBB-FreeCOCOG (CocoGlide)TGIFSAGI-SPSAGI-FR
Applications
image forgery localizationgenerative inpainting detectiondeepfake localization