benchmark 2025

DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection

Kangran Zhao ^1,2, Yupeng Chen ^1,2, Xiaoyu Zhang ^1,2, Yize Chen ^1,2, Weinan Guan ^1,2, Baicheng Chen ^1,2, Chengzhe Sun ², Soumyya Kanti Datta ², Qingshan Liu ³, Siwei Lyu ², Baoyuan Wu ^1,2

¹ The Chinese University of Hong Kong, Shenzhen

² University at Buffalo, State University of New York

³ Nanjing University of Posts and Telecommunications

1 citations · 90 references · arXiv

Published on arXiv

2510.22622

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

DeepfakeBench-MM unifies evaluation of 11 multimodal deepfake detectors across 5 datasets; Mega-MMDF is one of the largest multimodal deepfake datasets with 1.1M forged samples from 21 forgery pipelines.

DeepfakeBench-MM

Novel technique introduced

The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega-MMDF, a large-scale, diverse, and high-quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio-driven face reenactment methods. Mega-MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the largest and most diverse multimodal deepfake datasets, with plans for continuous expansion. Building on it, we present DeepfakeBench-MM, the first unified benchmark for multimodal deepfake detection. It establishes standardized protocols across the entire detection pipeline and serves as a versatile platform for evaluating existing methods as well as exploring novel approaches. DeepfakeBench-MM currently supports 5 datasets and 11 multimodal deepfake detectors. Furthermore, our comprehensive evaluations and in-depth analyses uncover several key findings from multiple perspectives (e.g., augmentation, stacked forgery). We believe that DeepfakeBench-MM, together with our large-scale Mega-MMDF, will serve as foundational infrastructures for advancing multimodal deepfake detection.

Key Contributions

Mega-MMDF: a large-scale multimodal deepfake dataset with 1.1M forged samples generated via 21 forgery pipelines (10 audio, 12 visual, 6 audio-driven face reenactment methods)
DeepfakeBench-MM: the first unified benchmark for multimodal deepfake detection, supporting 5 datasets and 11 detectors with standardized evaluation protocols
Comprehensive empirical analyses across multiple perspectives (augmentation, stacked forgery) revealing key insights for multimodal deepfake detection

🛡️ Threat Analysis

Output Integrity Attack

Deepfake detection is a core ML09 concern — the paper builds infrastructure (dataset + evaluation platform) for detecting AI-generated audiovisual content, directly advancing output integrity and content authenticity research.

Details

Domains

visionaudiomultimodal

Model Types

multimodalgenerative

Threat Tags

digital

Datasets

Mega-MMDFFakeAVCelebDFDCFaceForensics++LAV-DF

Applications

deepfake detectionaudiovisual content authenticationmultimodal forgery detection

Read PDF arXiv DOI

DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection

Investigating self-supervised representations for audio-visual deepfake detection

Word-Anchored Temporal Forgery Localization

Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization