tool 2026

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Kartik Kuckreja 1, Parul Gupta 2, Muhammad Haris Khan 1, Abhinav Dhall 2

0 citations · 52 references · arXiv (Cornell University)

α

Published on arXiv

2602.19715

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Reasoning-bootstrapped judge achieves 96.2% accuracy and 98.9% pairwise human agreement on meta-evaluation, outperforming baselines 30x its size; users preferred its rationales 70% of the time for faithfulness and groundedness.

DeepfakeJudge

Novel technique introduced


Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\%, outperforming \texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70\% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \href{https://github.com/KjAeRsTuIsK/DeepfakeJudge}{open-sourced}.


Key Contributions

  • DeepfakeJudge: a bootstrapped generator-evaluator framework that scales human feedback into structured reasoning supervision for deepfake detection models
  • An out-of-distribution benchmark with recent generative and editing forgeries plus a human-annotated subset with visual reasoning labels for meta-evaluation
  • Pointwise and pairwise reasoning evaluation models that achieve 96.2% accuracy and 98.9% pairwise agreement with human raters without requiring explicit ground-truth rationales

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection (deepfake images/videos). DeepfakeJudge is a framework for evaluating and supervising the reasoning quality of deepfake detectors — output integrity and content authenticity is the core threat being addressed.


Details

Domains
visionmultimodal
Model Types
vlmdiffusion
Threat Tags
inference_time
Datasets
custom out-of-distribution deepfake benchmarkhuman-annotated meta-evaluation subset
Applications
deepfake detectioncontent authenticity verificationai-generated image detection