defense 2025

Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization

Chenrui Wang 1, Junyi Shu 1, Billy Chiu 2, Yu Li 3, Saleh Alharbi 4, Min Zhang 1, Jing Li 1

0 citations · 44 references · arXiv

α

Published on arXiv

2510.15976

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

LTW significantly enhances generated text quality without compromising watermark detectability when integrated with baseline watermarking methods including KGW

Learning to Watermark (LTW)

Novel technique introduced


The rapid development of LLMs has raised concerns about their potential misuse, leading to various watermarking schemes that typically offer high detectability. However, existing watermarking techniques often face trade-off between watermark detectability and generated text quality. In this paper, we introduce Learning to Watermark (LTW), a novel selective watermarking framework that leverages multi-objective optimization to effectively balance these competing goals. LTW features a lightweight network that adaptively decides when to apply the watermark by analyzing sentence embeddings, token entropy, and current watermarking ratio. Training of the network involves two specifically constructed loss functions that guide the model toward Pareto-optimal solutions, thereby harmonizing watermark detectability and text quality. By integrating LTW with two baseline watermarking methods, our experimental evaluations demonstrate that LTW significantly enhances text quality without compromising detectability. Our selective watermarking approach offers a new perspective for designing watermarks for LLMs and a way to preserve high text quality for watermarks. The code is publicly available at: https://github.com/fattyray/learning-to-watermark


Key Contributions

  • Selector Network — a lightweight MLP that adaptively decides when to apply watermarking using sentence embeddings, token entropy, and current watermarking ratio
  • Multi-objective training with detectability and quality loss functions guiding the Selector Network toward Pareto-optimal watermarking decisions via MGDA
  • Plug-in framework that wraps existing watermarking methods (e.g., KGW) to improve text quality without sacrificing detectability

🛡️ Threat Analysis

Output Integrity Attack

Embeds watermarks in LLM-generated text outputs to enable AI-generated content detection and trace provenance — output integrity and content authenticity, not model IP protection.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
llm text generationai-generated text detectioncontent provenance