benchmark 2026

SynthForensics: A Multi-Generator Benchmark for Detecting Synthetic Video Deepfakes

Roberto Leotta ¹, Salvatore Alfio Sambataro ², Claudio Vittorio Ragaglia ², Mirko Casu ², Yuri Petralia ¹, Francesco Guarnera ², Luca Guarnera ², Sebastiano Battiato ²

¹ iCTLab s.r.l.

² University of Catania

0 citations · 57 references · arXiv (Cornell University)

Published on arXiv

2602.04939

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

State-of-the-art deepfake detectors show a mean AUC drop of 29.19% on T2V synthetic videos, with some performing worse than random chance; training on SynthForensics recovers generalization to 93.81% AUC on unseen generators.

SynthForensics

Novel technique introduced

The landscape of synthetic media has been irrevocably altered by text-to-video (T2V) models, whose outputs are rapidly approaching indistinguishability from reality. Critically, this technology is no longer confined to large-scale labs; the proliferation of efficient, open-source generators is democratizing the ability to create high-fidelity synthetic content on consumer-grade hardware. This makes existing face-centric and manipulation-based benchmarks obsolete. To address this urgent threat, we introduce SynthForensics, to the best of our knowledge the first human-centric benchmark for detecting purely synthetic video deepfakes. The benchmark comprises 6,815 unique videos from five architecturally distinct, state-of-the-art open-source T2V models. Its construction was underpinned by a meticulous two-stage, human-in-the-loop validation to ensure high semantic and visual quality. Each video is provided in four versions (raw, lossless, light, and heavy compression) to enable real-world robustness testing. Experiments demonstrate that state-of-the-art detectors are both fragile and exhibit limited generalization when evaluated on this new domain: we observe a mean performance drop of $29.19\%$ AUC, with some methods performing worse than random chance, and top models losing over 30 points under heavy compression. The paper further investigates the efficacy of training on SynthForensics as a means to mitigate these observed performance gaps, achieving robust generalization to unseen generators ($93.81\%$ AUC), though at the cost of reduced backward compatibility with traditional manipulation-based deepfakes. The complete dataset and all generation metadata, including the specific prompts and inference parameters for every video, will be made publicly available at [link anonymized for review].

Key Contributions

SynthForensics: the first human-centric benchmark for purely synthetic video deepfake detection, comprising 6,815 videos from five distinct open-source T2V models with four compression variants each
Two-stage human-in-the-loop validation pipeline ensuring high semantic and visual quality of synthetic videos paired with real source videos
Empirical evaluation revealing state-of-the-art detectors suffer a mean AUC drop of 29.19% on synthetic T2V deepfakes, with training on SynthForensics achieving 93.81% AUC generalization to unseen generators

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection — specifically detecting purely synthetic video deepfakes produced by text-to-video models. The benchmark evaluates content authenticity and the ability to distinguish real from AI-generated video, which is output integrity and provenance verification. The paper also proposes training strategies to improve detector generalization across unseen generators.

Details

Domains

visiongenerative

Model Types

diffusiontransformer

Threat Tags

inference_timedigital

Datasets

FaceForensics++ (FF++)DeepFakeDetection (DFD)SynthForensics

Applications

video deepfake detectionsynthetic media forensicscontent authenticity

Read PDF arXiv DOI

SynthForensics: A Multi-Generator Benchmark for Detecting Synthetic Video Deepfakes

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content

NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization

Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?

DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline

DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis

Deepfake Synthesis vs. Detection: An Uneven Contest

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark