Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.

Key Contributions

Identifies 3D geometric temporal consistency (via vanishing points) as a discriminative forensic signal distinguishing real from AI-generated videos
Introduces Grab-3D, a geometry-aware transformer with geometric positional encoding, temporal-geometric attention, and an EMA-based classifier head
Constructs a static-scene AI-generated video dataset enabling stable 3D geometric feature extraction and cross-domain generalization benchmarking

🛡️ Threat Analysis

Output Integrity Attack

Proposes a novel AI-generated video detection framework — detecting AI-generated content (videos produced by diffusion models) is explicitly within ML09's scope of output integrity and content provenance. The novelty lies in the forensic insight (vanishing-point-based 3D geometric inconsistency) and the detection architecture, not merely applying existing detectors to a new domain.

Details

Domains

visiongenerative

Model Types

transformerdiffusion

Threat Tags

inference_time

Datasets

Custom static-scene AI-generated video dataset

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

D3: Training-Free AI-Generated Video Detection Using Second-Order Features

SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

Detecting Generated Images by Fitting Natural Image Distributions

CINEMAE: Leveraging Frozen Masked Autoencoders for Cross-Generator AI Image Detection

Exposing DeepFakes via Hyperspectral Domain Mapping

Training-free Detection of AI-generated images via Cropping Robustness

Rethinking the Use of Vision Transformers for AI-Generated Image Detection

Detecting AI-Generated Images via Distributional Deviations from Real Images