tool 2026

EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

Hung Mai 1,2, Loi Dinh 3, Duc Hai Nguyen 1,4, Dat Do 5, Luong Doan 1, Khanh Nguyen Quoc 1,4, Huan Vu 2, Naeem Ul Islam 6, Tuan Do 1

0 citations · 85 references · arXiv (Cornell University)

α

Published on arXiv

2602.17260

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

EA-Swin achieves 0.97–0.99 accuracy across major AI video generators, outperforming prior SoTA methods by 5–20% while maintaining strong cross-distribution generalization to unseen generators.

EA-Swin

Novel technique introduced


Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.


Key Contributions

  • EA-Swin: an embedding-agnostic Swin Transformer with factorized windowed attention that models spatiotemporal dependencies on pretrained ViT-style video embeddings for robust synthetic video detection
  • EA-Video dataset: 130K-video benchmark spanning diverse commercial and open-source generators (including Sora2 and Veo3) with unseen-generator splits for cross-distribution evaluation
  • 5–20% accuracy improvement over prior SoTA (0.97–0.99 vs. 0.8–0.9) while generalizing to unseen generator distributions

🛡️ Threat Analysis

Output Integrity Attack

EA-Swin is a detection system for AI-generated (synthetic) video content, directly addressing output integrity and provenance authentication. Detecting whether video was produced by AI generators (Sora2, Veo3, diffusion models) is a canonical ML09 AI-generated content detection task.


Details

Domains
visiongenerative
Model Types
transformer
Threat Tags
inference_time
Datasets
EA-Video (130K videos, newly constructed)GenVidBenchVidProm
Applications
ai-generated video detectionsynthetic video detectiondeepfake video detection