HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models
Kexin Li , Guozhen Ding , Ilya Grishchenko , David Lie
Published on arXiv
2512.00094
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves 100% recall and 1.0 AUC on outputs of a LoRA-fine-tuned diffusion model trained on watermarked images, with 98.57% detection accuracy and 95.07% bit recovery under various distortions.
HMARK
Novel technique introduced
Modern generative diffusion models rely on vast training datasets, often including images with uncertain ownership or usage rights. Radioactive watermarks -- marks that transfer to a model's outputs -- can help detect when such unauthorized data has been used for training. Moreover, aside from being radioactive, an effective watermark for protecting images from unauthorized training also needs to meet other existing requirements, such as imperceptibility, robustness, and multi-bit capacity. To overcome these challenges, we propose HMARK, a novel multi-bit watermarking scheme, which encodes ownership information as secret bits in the semantic-latent space (h-space) for image diffusion models. By leveraging the interpretability and semantic significance of h-space, ensuring that watermark signals correspond to meaningful semantic attributes, the watermarks embedded by HMARK exhibit radioactivity, robustness to distortions, and minimal impact on perceptual quality. Experimental results demonstrate that HMARK achieves 98.57% watermark detection accuracy, 95.07% bit-level recovery accuracy, 100% recall rate, and 1.0 AUC on images produced by the downstream adversarial model finetuned with LoRA on watermarked data across various types of distortions.
Key Contributions
- Novel multi-bit watermarking scheme (HMARK) that embeds ownership information in the semantic-latent h-space of diffusion models, making watermarks both imperceptible and semantically meaningful
- Radioactive watermark design that survives LoRA fine-tuning, enabling detection of unauthorized training data use in downstream adversarially fine-tuned models
- Comprehensive evaluation demonstrating 98.57% detection accuracy, 95.07% bit-level recovery accuracy, 100% recall, and 1.0 AUC across multiple distortion types
🛡️ Threat Analysis
Watermarks are embedded in training images (content provenance) and are 'radioactive' — transferring to model outputs when unauthorized training occurs. The guidelines explicitly state that watermarking training data to detect misappropriation belongs in ML09, not ML05, because the watermark resides in the data/content, not in model weights.