SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models

The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.

Key Contributions

Shuffle-Key-based Distribution-preserving Sampling (SKe) that uses a single base pseudo-random binary sequence permuted per frame, converting extraction from synchronization-sensitive decoding to permutation-tolerant set-level aggregation
Differential Attention (DA) module that computes inter-frame differences and dynamically reweights extraction attention to counteract temporal distortions like inter-frame compression
First generative watermarking framework specifically designed for text-to-video diffusion models, robust against frame reordering, frame deletion, compression, and noise

🛡️ Threat Analysis

Output Integrity Attack

SKeDA watermarks AI-generated VIDEO OUTPUTS (not model weights) to trace content provenance, verify authenticity, and support copyright attribution — classic output integrity / content watermarking. The watermark is embedded in the generated content itself, not in model parameters.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

inference_timedigital

Applications

2025 1 cit.

Output Integrity Attack

100%