SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models
Yang Yang 1, Xinze Zou 1, Zehua Ma 2, Han Fang 3, Weiming Zhang 2
Published on arXiv
2603.00194
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
SKeDA outperforms existing baselines in both fidelity and traceability under video-specific distortions including inter-frame compression, frame deletion, and noise
SKeDA
Novel technique introduced
The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
Key Contributions
- Shuffle-Key-based Distribution-preserving Sampling (SKe) that uses a single base pseudo-random binary sequence permuted per frame, converting extraction from synchronization-sensitive decoding to permutation-tolerant set-level aggregation
- Differential Attention (DA) module that computes inter-frame differences and dynamically reweights extraction attention to counteract temporal distortions like inter-frame compression
- First generative watermarking framework specifically designed for text-to-video diffusion models, robust against frame reordering, frame deletion, compression, and noise
🛡️ Threat Analysis
SKeDA watermarks AI-generated VIDEO OUTPUTS (not model weights) to trace content provenance, verify authenticity, and support copyright attribution — classic output integrity / content watermarking. The watermark is embedded in the generated content itself, not in model parameters.