EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
Hung Mai 1,2, Loi Dinh 3, Duc Hai Nguyen 1,4, Dat Do 5, Luong Doan 1, Khanh Nguyen Quoc 1,4, Huan Vu 2, Naeem Ul Islam 6, Tuan Do 1
1 N2TP Technology Solution JSC
2 College of Technology, National Economics University
3 University of Science, Vietnam National University
Published on arXiv
2602.17260
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
EA-Swin achieves 0.97–0.99 accuracy across major AI video generators, outperforming prior SoTA methods by 5–20% while maintaining strong cross-distribution generalization to unseen generators.
EA-Swin
Novel technique introduced
Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.
Key Contributions
- EA-Swin: an embedding-agnostic Swin Transformer with factorized windowed attention that models spatiotemporal dependencies on pretrained ViT-style video embeddings for robust synthetic video detection
- EA-Video dataset: 130K-video benchmark spanning diverse commercial and open-source generators (including Sora2 and Veo3) with unseen-generator splits for cross-distribution evaluation
- 5–20% accuracy improvement over prior SoTA (0.97–0.99 vs. 0.8–0.9) while generalizing to unseen generator distributions
🛡️ Threat Analysis
EA-Swin is a detection system for AI-generated (synthetic) video content, directly addressing output integrity and provenance authentication. Detecting whether video was produced by AI generators (Sora2, Veo3, diffusion models) is a canonical ML09 AI-generated content detection task.