benchmark 2025

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Jiaqi Wang 1, Weijia Wu 2, Yi Zhan 1, Rui Zhao 2, Ming Hu 1, James Cheng 1, Wei Liu 3, Philip Torr 4, Kevin Qinghong Lin 4

1 citations · 52 references · arXiv

α

Published on arXiv

2512.13281

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Best VGM (Veo3.1-Fast) is detected as fake only 12.54% of the time; best VLM reviewer (Gemini-2.5-Pro) achieves 68.44% overall detection accuracy, far below human expert accuracy of 89.11%, exposing a large gap in perceptual fidelity evaluation for current VLMs.

Video Reality Test (peer-review benchmark)

Novel technique introduced


Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.


Key Contributions

  • Video Reality Test benchmark of ASMR videos with tight audio-visual coupling for evaluating AI-generated video detection under challenging perceptual conditions
  • Adversarial peer-review evaluation protocol pairing VGMs as creators against VLMs as reviewers across multiple generation and detection settings
  • Empirical findings showing VLMs top out at 76.27% accuracy vs. 89.11% for human experts, and that superficial watermark cues — not genuine perceptual reasoning — drive much of VLM detection performance

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses AI-generated content detection — evaluating whether VLMs can distinguish real from synthetic video outputs. The adversarial creator-reviewer protocol assesses output integrity and the limits of current forensic detection methods.


Details

Domains
visionaudiomultimodal
Model Types
vlmdiffusionmultimodal
Threat Tags
inference_time
Datasets
Video Reality Test (VRT) — curated ASMR videos
Applications
ai-generated video detectiondeepfake detectionaudio-visual content authenticity