Yet Another Watermark for Large Language Models

Existing watermarking methods for large language models (LLMs) mainly embed watermark by adjusting the token sampling prediction or post-processing, lacking intrinsic coupling with LLMs, which may significantly reduce the semantic quality of the generated marked texts. Traditional watermarking methods based on training or fine-tuning may be extendable to LLMs. However, most of them are limited to the white-box scenario, or very time-consuming due to the massive parameters of LLMs. In this paper, we present a new watermarking framework for LLMs, where the watermark is embedded into the LLM by manipulating the internal parameters of the LLM, and can be extracted from the generated text without accessing the LLM. Comparing with related methods, the proposed method entangles the watermark with the intrinsic parameters of the LLM, which better balances the robustness and imperceptibility of the watermark. Moreover, the proposed method enables us to extract the watermark under the black-box scenario, which is computationally efficient for use. Experimental results have also verified the feasibility, superiority and practicality. This work provides a new perspective different from mainstream works, which may shed light on future research.

Key Contributions

Novel LLM watermarking mechanism that embeds watermarks via structured, sparse manipulation of output-layer weight parameters rather than at generation/sampling time
Black-box watermark extraction from generated text without requiring access to the watermarked model, enabling computationally efficient verification
Demonstrates improved balance between robustness and imperceptibility/text quality compared to token-sampling-based watermarking approaches

🛡️ Threat Analysis

Model Theft

The watermark is embedded directly into LLM model parameters (output-layer weight manipulation), and the paper's keywords explicitly state 'intellectual property protection' — this is model IP protection via model-weight watermarking, the canonical ML05 use case.

Output Integrity Attack

The watermark manifests in and is verified from LLM-generated text outputs without requiring model access (black-box detection), making it a content provenance scheme for tracing which LLM produced a given text — a direct ML09 content watermarking contribution comparable to KGW-style token-sampling watermarks.