PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs
Jiaqi Xue 1, Yifei Zhao 1, Mansour Al Ghanim 1, Shangqian Gao 2, Ruimin Sun 3, Qian Lou 1, Mengxin Zheng 1
Published on arXiv
2510.23891
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
PRO substantially outperforms prior watermarking methods in both watermark detectability and robustness against fine-tuning and model merging on mainstream open-source LLMs including LLaMA-3.2, LLaMA-3, and Phi-2.
PRO
Novel technique introduced
Text watermarking for large language models (LLMs) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source LLMs are relatively mature, extending them to open-source models remains challenging, as developers cannot control the decoding process. Consequently, owners of open-source LLMs lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method for open-source LLMs. PRO jointly trains a watermark policy model with the LLM, producing patterns that are easier for the model to learn and more consistent with detection criteria. A regularization term further simulates downstream perturbations and penalizes degradation in watermark detectability, ensuring robustness under model edits. Experiments on open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO substantially improves both watermark detectability and resilience to model modifications.
Key Contributions
- Jointly trained watermark policy model co-optimized with the LLM to produce patterns that are easier to learn and more consistent with detection criteria
- Regularization term that simulates downstream perturbations (fine-tuning, model merging) to penalize watermark degradation and ensure resilience after model modifications
- Demonstrated significant improvements in detectability and robustness over prior methods on LLaMA-3.2, LLaMA-3, and Phi-2
🛡️ Threat Analysis
PRO watermarks LLM TEXT OUTPUTS for content provenance — verifying whether a given text was generated by a specific model. Although the mechanism modifies model weights (to bypass decoding control issues in open-source settings), the watermark signal ends up in generated text, making this an output integrity/content provenance scheme. The robustness to fine-tuning and model merging is about maintaining content watermark detectability, not proving model ownership.