defense 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie 1,2, Muhammad Siddeek 3, Mohamed Seif 2, Andrea J. Goldsmith 2,4, Mengdi Wang 2

1 citations · 37 references · arXiv

α

Published on arXiv

2510.01637

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

The proposed combinatorial watermarking framework achieves strong edit localization accuracy across replacement, deletion, and insertion edits while maintaining watermark detectability comparable to state-of-the-art methods.

Combinatorial Pattern-Based Watermarking

Novel technique introduced


Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.


Key Contributions

  • Formally defines the new task of post-generation edit detection and localization for watermarked LLM outputs, with task-specific evaluation metrics (Type-I error rate and detection accuracy)
  • Proposes a combinatorial pattern-based watermarking framework that partitions the vocabulary into disjoint subsets and enforces deterministic patterns at generation time, enabling both global watermark detection and local edit localization
  • Demonstrates strong empirical edit localization performance across replacement, deletion, and insertion scenarios on open-source LLMs, while maintaining detection rates competitive with state-of-the-art watermarking schemes

🛡️ Threat Analysis

Output Integrity Attack

Proposes a content watermarking scheme embedded in LLM-generated text outputs (not model weights) to verify provenance and detect/localize post-generation tampering and spoofing attacks — this is directly about output integrity and AI-generated content authentication.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
custom evaluation dataset with open-source LLMs
Applications
llm text provenanceai-generated content attributioncollaborative writing integrityacademic integrity