benchmark 2025

MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

Mengxue Hu 1, Yunfeng Diao 1, Changtao Miao 2, Jianshu Li 2, Zhe Li 2, Joey Tianyi Zhou 3

1 citations · 44 references · arXiv

α

Published on arXiv

2512.00336

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

MVAD is the first general-purpose benchmark for detecting AI-generated multimodal video-audio content, covering four multimodal data types and over twenty distinct generators across realistic and anime visual domains

MVAD

Novel technique introduced


The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.


Key Contributions

  • First comprehensive multimodal video-audio dataset (MVAD) specifically designed for general-purpose AIGC detection, covering three realistic forgery patterns (fake video-fake audio, fake video-real audio, real video-fake audio)
  • Dataset spans 20+ state-of-the-art generators, two visual styles (realistic and anime), and four content categories (humans, animals, objects, scenes)
  • Addresses the critical gap left by existing datasets that are limited to facial deepfakes only, enabling research on broader AI-generated multimodal content

🛡️ Threat Analysis

Output Integrity Attack

The dataset is explicitly designed to support detection of AI-generated video-audio content (deepfakes and synthetic media), which is the core focus of ML09 output integrity and AIGC detection research. The three forgery patterns simulate real-world content authenticity threats.


Details

Domains
visionaudiomultimodalgenerative
Model Types
diffusionganmultimodal
Threat Tags
inference_timedigital
Datasets
MVAD
Applications
ai-generated content detectiondeepfake detectionmultimodal forgery detection