tool 2026

EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

0 citations · 85 references · arXiv (Cornell University)

Published on arXiv

2602.17260

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

EA-Swin achieves 0.97–0.99 accuracy across major AI video generators, outperforming prior SoTA methods by 5–20% while maintaining strong cross-distribution generalization to unseen generators.

EA-Swin

Novel technique introduced

Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.

Key Contributions

EA-Swin: an embedding-agnostic Swin Transformer with factorized windowed attention that models spatiotemporal dependencies on pretrained ViT-style video embeddings for robust synthetic video detection
EA-Video dataset: 130K-video benchmark spanning diverse commercial and open-source generators (including Sora2 and Veo3) with unseen-generator splits for cross-distribution evaluation
5–20% accuracy improvement over prior SoTA (0.97–0.99 vs. 0.8–0.9) while generalizing to unseen generator distributions

🛡️ Threat Analysis

Output Integrity Attack

EA-Swin is a detection system for AI-generated (synthetic) video content, directly addressing output integrity and provenance authentication. Detecting whether video was produced by AI generators (Sora2, Veo3, diffusion models) is a canonical ML09 AI-generated content detection task.

Details

Domains

visiongenerative

Model Types

transformer

Threat Tags

inference_time

Datasets

EA-Video (130K videos, newly constructed)GenVidBenchVidProm

Applications

ai-generated video detectionsynthetic video detectiondeepfake video detection

Read PDF arXiv DOI

EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Redefining Generalization in Visual Domains: A Two-Axis Framework for Fake Image Detection with FusionDetect

SAGA: Source Attribution of Generative AI Videos

FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

Towards Robust Red-Green Watermarking for Autoregressive Image Generators

MarkDiffusion: An Open-Source Toolkit for Generative Watermarking of Latent Diffusion Models

REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Provenance Verification of AI-Generated Images via a Perceptual Hash Registry Anchored on Blockchain