defense 2025

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao ¹, Zhiwei Steven Wu ², Adam Block ³

¹ University of Pennsylvania

² Carnegie Mellon University

³ Columbia University

0 citations · 50 references · arXiv

Published on arXiv

2512.04044

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking methods (KGW, Gumbel-max, SynthID) while remaining robust to paraphrasing and fine-tuning removal attacks on both Qwen3-4B and Llama2-7B.

MarkTune

Novel technique introduced

Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.

Key Contributions

MarkTune: an on-policy fine-tuning framework that treats the GaussMark watermark signal as a reward while regularizing against text quality degradation, improving the quality-detectability Pareto frontier
Theoretical derivation showing MarkTune as a principled improvement over GaussMark via watermark-aware weight updates in the model's representation space
Empirical demonstration that MarkTune approaches inference-time watermark performance, remains robust to paraphrasing and fine-tuning attacks, and generalizes across unseen datasets

🛡️ Threat Analysis

Output Integrity Attack

MarkTune watermarks LLM-generated TEXT for provenance and authenticity — detection is performed on generated text samples to verify they came from the watermarked model. Although the signal is injected via weight modifications rather than inference-time sampling, the goal is content provenance (academic integrity, misinformation mitigation), not model IP protection. This matches ML09 content watermarking, analogous to KGW/SynthID which the paper directly benchmarks against.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timewhite_box

Datasets

Qwen3-4B evaluationsLlama2-7B evaluations

Applications

open-weight llm watermarkingai-generated text provenanceacademic integrity detectionmisinformation mitigation

Read PDF arXiv DOI Code

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Improving Detection of Watermarked Language Models

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees