SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
Chao Wang 1, Zijin Yang 1, Yaofei Wang 2, Yuang Qi 1, Weiming Zhang 1, Nenghai Yu 1, Kejiang Chen 1
Published on arXiv
2603.08536
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves over 90% average source attribution accuracy across five state-of-the-art video generation models using as few as 20 samples, with zero-shot attribution possible for three models.
SWIFT
Novel technique introduced
Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.
Key Contributions
- First formal definition of the 'few-shot training-free generated video attribution' task
- SWIFT method exploiting the 'Pixel Frames→Latent Frame' temporal mapping within video chunks via a fixed-length sliding window performing normal vs. corrupted reconstructions, using the loss variation as an attribution signal
- Achieves >90% average attribution accuracy across 5 SOTA video generators using only 20 samples, with zero-shot capability for HunyuanVideo, EasyAnimate, and Wan2.2
🛡️ Threat Analysis
Addresses AI-generated content provenance and attribution — determining which generative model produced a given video. This is a novel forensic technique for content authenticity and traceability, squarely within ML09's scope of output integrity and AI-generated content detection/attribution.