SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion
Xinjie Zhu , Zijing Zhao , Hui Jin , Qingxiao Guo , Yilong Ma , Yunhao Wang , Xiaobing Guo , Weifeng Zhang
Published on arXiv
2603.02882
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves very high bit-accuracy under both temporal and spatial disturbances with minimal overhead on modern video diffusion models with causal 3D VAEs, outperforming non-blind in-generation baselines in scalability and robustness.
SIGMark
Novel technique introduced
Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at https://jeremyzhao1998.github.io/SIGMark-release/.
Key Contributions
- GF-PRC (Global set of Frame-wise PseudoRandom Coding keys) enabling blind watermark extraction without storing large-scale message-key pairs, while preserving distortion-free noise distribution
- SGO (Segment Group-Ordering) module tailored to causal 3D VAEs that ensures robust watermark inversion under temporal disturbances during extraction
- End-to-end SIGMark framework achieving high bit-accuracy under both temporal and spatial distortions with minimal computational overhead on modern video diffusion models
🛡️ Threat Analysis
Watermarks are embedded in diffusion model VIDEO OUTPUTS (generated content) to trace provenance and authenticate AI-generated videos — this is content watermarking / output integrity, not model weight protection. The framework enables blind extraction without maintaining message-key pairs, directly addressing scalable content provenance at deployment.