ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
Hang Wang 1,2, Chao Shen 1, Lei Zhang 2, Zhi-Qi Cheng 3
Published on arXiv
2604.04029
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Significantly outperforms state-of-the-art methods across AP, AUC, and ACC metrics on four large-scale benchmarks with superior cross-model generalization
ATSS
Novel technique introduced
AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.
Key Contributions
- Identifies anomalous temporal self-similarity (ATSS) as a forensic fingerprint distinguishing AI-generated from real videos
- Proposes triple-similarity representation (visual, textual, cross-modal) with cross-attentive fusion for multimodal temporal anomaly detection
- Achieves SOTA performance on four benchmarks (GenVideo, EvalCrafter, VideoPhy, VidProM) with superior generalization
🛡️ Threat Analysis
Paper focuses on detecting AI-generated video content to verify authenticity and provenance — this is output integrity. The detector distinguishes synthetic videos from real ones, a classic deepfake/AI-content detection task.