defense 2025

Yet Another Watermark for Large Language Models

Siyuan Bao 1, Ying Shi 1, Zhiguang Yang 1, Hanzhou Wu 1,2, Xinpeng Zhang 1

0 citations · International Conference on Co...

α

Published on arXiv

2509.12574

Model Theft

OWASP ML Top 10 — ML05

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Parameter-level LLM watermarking via output-layer weight manipulation preserves semantic text quality better than token-sampling methods while enabling black-box watermark detection from generated text.


Existing watermarking methods for large language models (LLMs) mainly embed watermark by adjusting the token sampling prediction or post-processing, lacking intrinsic coupling with LLMs, which may significantly reduce the semantic quality of the generated marked texts. Traditional watermarking methods based on training or fine-tuning may be extendable to LLMs. However, most of them are limited to the white-box scenario, or very time-consuming due to the massive parameters of LLMs. In this paper, we present a new watermarking framework for LLMs, where the watermark is embedded into the LLM by manipulating the internal parameters of the LLM, and can be extracted from the generated text without accessing the LLM. Comparing with related methods, the proposed method entangles the watermark with the intrinsic parameters of the LLM, which better balances the robustness and imperceptibility of the watermark. Moreover, the proposed method enables us to extract the watermark under the black-box scenario, which is computationally efficient for use. Experimental results have also verified the feasibility, superiority and practicality. This work provides a new perspective different from mainstream works, which may shed light on future research.


Key Contributions

  • Novel LLM watermarking mechanism that embeds watermarks via structured, sparse manipulation of output-layer weight parameters rather than at generation/sampling time
  • Black-box watermark extraction from generated text without requiring access to the watermarked model, enabling computationally efficient verification
  • Demonstrates improved balance between robustness and imperceptibility/text quality compared to token-sampling-based watermarking approaches

🛡️ Threat Analysis

Model Theft

The watermark is embedded directly into LLM model parameters (output-layer weight manipulation), and the paper's keywords explicitly state 'intellectual property protection' — this is model IP protection via model-weight watermarking, the canonical ML05 use case.

Output Integrity Attack

The watermark manifests in and is verified from LLM-generated text outputs without requiring model access (black-box detection), making it a content provenance scheme for tracing which LLM produced a given text — a direct ML09 content watermarking contribution comparable to KGW-style token-sampling watermarks.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_box
Applications
large language model ip protectionai-generated text provenancellm content traceability