A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model
Qi Zheng 1,2, Shuliang Liu 1,2, Yu Huang 1,2, Sihang Jia 1,2, Jungang Li 1,2, Lyuhao Chen 3, Junhao Chen 1,2, Hanqian Li 1,2, Aiwei Liu 1,2, Yibo Yan 1,2, Xuming Hu 1,2
1 The Hong Kong University of Science and Technology (Guangzhou)
Published on arXiv
2601.07291
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
VISA-Mark achieves 96.88% AUC detection accuracy and 99.3% attack resilience while improving visual consistency by 7.8% (Chair-I) over vision-agnostic watermarking baselines.
VISA-Mark
Novel technique introduced
Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases, while some semantic-aware methods incur prohibitive inference latency due to rejection sampling. In this paper, we propose the VIsual Semantic Adaptive Watermark (VISA-Mark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights, which quantify the evidentiary support for candidate tokens based on the visual input. These weights guide an adaptive vocabulary partitioning and logits perturbation mechanism, concentrating watermark strength specifically on visually-supported tokens. By actively aligning the watermark with visual evidence, VISA-Mark effectively maintains visual fidelity. Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency (Chair-I) and superior semantic fidelity. The framework maintains highly competitive detection accuracy (96.88% AUC) and robust attack resilience (99.3%) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multimodal watermarking.
Key Contributions
- VISA-Mark framework that adaptively concentrates watermark strength on visually-supported tokens using dynamic Visual-Evidence Weights, preserving visual fidelity of LVLM outputs
- Lightweight prefix-tuner that extracts token-level visual evidence weights to guide adaptive vocabulary partitioning and logits perturbation without rejection sampling overhead
- Demonstrated 7.8% improvement in visual consistency (Chair-I), 96.88% AUC detection accuracy, and 99.3% attack resilience over vision-agnostic baseline watermarking methods
🛡️ Threat Analysis
VISA-Mark watermarks the TEXT OUTPUTS of LVLMs to enable content traceability and provenance tracking — the watermark is embedded in generated outputs (via logits perturbation at inference time), not in model weights. This is output integrity / AI-generated content attribution, not model IP protection (ML05).