benchmark 2025

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Pierre Fernandez , Tom Sander , Hady Elsahar , Hongyan Chang , Tomáš Souček , Valeriu Lacatusu , Tuan Tran , Sylvestre-Alvise Rebuffi , Alexandre Mourachko

0 citations · 55 references · arXiv

α

Published on arXiv

2512.16904

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Gumbel-max scheme with beam search achieves strong detectability and semantic fidelity on open-ended text, but all methods degrade significantly on code, where smaller models counterintuitively outperform larger ones.

post-hoc LLM watermarking with Gumbel-max

Novel technique introduced


Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.


Key Contributions

  • Systematic empirical study of post-hoc watermarking strategies (beam search, multi-candidate generation, entropy filtering) and their effect on the quality-detectability trade-off
  • Finding that the Gumbel-max watermarking scheme outperforms more recent alternatives under nucleus sampling, and beam search substantially benefits most methods
  • Identification of a counterintuitive failure mode: most watermarking approaches struggle on verifiable text (code), where smaller rephrasing models outperform larger ones

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses content watermarking of LLM text outputs to embed detectable provenance signals, enabling traceability of AI-generated content, copyright protection, and detection of watermarked content in training/RAG pipelines — canonical output integrity and content provenance work.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
books corpuscode datasets
Applications
text provenanceai-generated content traceabilitycopyright protectionrag content detection