How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.

Key Contributions

Systematic empirical study of post-hoc watermarking strategies (beam search, multi-candidate generation, entropy filtering) and their effect on the quality-detectability trade-off
Finding that the Gumbel-max watermarking scheme outperforms more recent alternatives under nucleus sampling, and beam search substantially benefits most methods
Identification of a counterintuitive failure mode: most watermarking approaches struggle on verifiable text (code), where smaller rephrasing models outperform larger ones

🛡️ Threat Analysis

Output Integrity Attack

Directly addresses content watermarking of LLM text outputs to embed detectable provenance signals, enabling traceability of AI-generated content, copyright protection, and detection of watermarked content in training/RAG pipelines — canonical output integrity and content provenance work.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

books corpuscode datasets

Applications

2025 0 cit.

Output Integrity Attack

100%