defense 2026

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

Xiaokun Luan , Yihao Zhang , Pengcheng Su , Feiran Lei , Meng Sun

0 citations

α

Published on arXiv

2604.27666

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves privacy-preserving and verifiable watermark detection practical for short texts while reassessing robustness against paraphrasing attacks

VOW

Novel technique introduced


Large Language Model (LLM) watermarking is crucial for establishing the provenance of machine-generated text, but most existing methods rely on a centralized trust model. This model forces users to reveal potentially sensitive text to a provider for detection and offers no way to verify the integrity of the result. While asymmetric schemes have been proposed to address these issues, they are either impractical for short texts or lack formal guarantees linking watermark insertion and detection. We propose VOW, a new protocol that achieves both privacy-preserving and cryptographically verifiable watermark detection with high efficiency. Our approach formulates detection as a secure two-party computation problem, instantiating the watermark's core logic with a Verifiable Oblivious Pseudorandom Function (VOPRF). This allows the user and provider to perform detection without the user's text being revealed, while the provider's result is verifiable. Our comprehensive evaluation shows that VOW is practical for short texts and provides a crucial reassessment of watermark robustness against modern paraphrasing attacks.


Key Contributions

  • VOPRF-based protocol enabling privacy-preserving watermark detection where users don't reveal text to providers
  • Cryptographically verifiable detection results with formal guarantees linking insertion and detection
  • Practical implementation for short texts with evaluation against modern paraphrasing attacks

🛡️ Threat Analysis

Output Integrity Attack

Watermarks LLM-generated text outputs to establish content provenance and detect AI-generated text — this is output integrity. The paper addresses both watermark insertion and detection with cryptographic verification.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
text provenanceai-generated content detectioncontent attribution