WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.

Key Contributions

First unified evaluation framework (WMVLM) covering both residual and semantic watermark types for diffusion model outputs, providing interpretable text generation alongside classification and scoring
Redefined quality and security metrics per watermark type: artifact strength and erasure resistance for residual watermarks; latent distribution shifts for semantic watermarks
Three-stage progressive training strategy enabling a VLM to perform classification, scoring, and natural-language interpretation of watermark quality and security

🛡️ Threat Analysis

Output Integrity Attack

The paper's primary contribution is evaluating watermarking schemes applied to diffusion model image outputs (content provenance/authenticity). It defines quality and security metrics for both residual and semantic watermarks, including erasure resistance — directly targeting output integrity of AI-generated content.

Details

Domains

visiongenerative

Model Types

vlmdiffusion

Threat Tags

inference_time

Applications

2026 1 cit.

Output Integrity Attack

83%