WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models
Zijin Yang 1,2, Yu Sun 3, Kejiang Chen 1,2, Jiawei Zhao 1,2, Jun Jiang 1,2, Weiming Zhang 1,2, Nenghai Yu 1,2
Published on arXiv
2601.21610
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
WMVLM outperforms state-of-the-art VLMs on watermark evaluation with strong generalization across datasets, diffusion models, and watermarking methods
WMVLM
Novel technique introduced
Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.
Key Contributions
- First unified evaluation framework (WMVLM) covering both residual and semantic watermark types for diffusion model outputs, providing interpretable text generation alongside classification and scoring
- Redefined quality and security metrics per watermark type: artifact strength and erasure resistance for residual watermarks; latent distribution shifts for semantic watermarks
- Three-stage progressive training strategy enabling a VLM to perform classification, scoring, and natural-language interpretation of watermark quality and security
🛡️ Threat Analysis
The paper's primary contribution is evaluating watermarking schemes applied to diffusion model image outputs (content provenance/authenticity). It defines quality and security metrics for both residual and semantic watermarks, including erasure resistance — directly targeting output integrity of AI-generated content.