defense 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

1 citations · 37 references · arXiv

Published on arXiv

2510.01637

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

The proposed combinatorial watermarking framework achieves strong edit localization accuracy across replacement, deletion, and insertion edits while maintaining watermark detectability comparable to state-of-the-art methods.

Combinatorial Pattern-Based Watermarking

Novel technique introduced

Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.

Key Contributions

Formally defines the new task of post-generation edit detection and localization for watermarked LLM outputs, with task-specific evaluation metrics (Type-I error rate and detection accuracy)
Proposes a combinatorial pattern-based watermarking framework that partitions the vocabulary into disjoint subsets and enforces deterministic patterns at generation time, enabling both global watermark detection and local edit localization
Demonstrates strong empirical edit localization performance across replacement, deletion, and insertion scenarios on open-source LLMs, while maintaining detection rates competitive with state-of-the-art watermarking schemes

🛡️ Threat Analysis

Output Integrity Attack

Proposes a content watermarking scheme embedded in LLM-generated text outputs (not model weights) to verify provenance and detect/localize post-generation tampering and spoofing attacks — this is directly about output integrity and AI-generated content authentication.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

custom evaluation dataset with open-source LLMs

Applications

llm text provenanceai-generated content attributioncollaborative writing integrityacademic integrity

Read PDF arXiv DOI

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective

MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

An Ensemble Framework for Unbiased Language Model Watermarking

ArcMark: Multi-bit LLM Watermark via Optimal Transport

A Unified Framework for LLM Watermarks

Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration

MC$^2$Mark: Distortion-Free Multi-Bit Watermarking for Long Messages