defense 2025

Adaptive Testing for Segmenting Watermarked Texts From Language Models

Xingchi Li 1, Xiaochi Liu 2, Guanxun Li 2

1 citations · 1 influential · 50 references · Stat

α

Published on arXiv

2511.06645

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Proposed framework robustly segments watermarked and non-watermarked text substrings without requiring precise prompt estimation, outperforming prior change-point methods in numerical experiments

Adaptive Segmentation Testing

Novel technique introduced


The rapid adoption of large language models (LLMs), such as GPT-4 and Claude 3.5, underscores the need to distinguish LLM-generated text from human-written content to mitigate the spread of misinformation and misuse in education. One promising approach to address this issue is the watermark technique, which embeds subtle statistical signals into LLM-generated text to enable reliable identification. In this paper, we first generalize the likelihood-based LLM detection method of a previous study by introducing a flexible weighted formulation, and further adapt this approach to the inverse transform sampling method. Moving beyond watermark detection, we extend this adaptive detection strategy to tackle the more challenging problem of segmenting a given text into watermarked and non-watermarked substrings. In contrast to the approach in a previous study, which relies on accurate estimation of next-token probabilities that are highly sensitive to prompt estimation, our proposed framework removes the need for precise prompt estimation. Extensive numerical experiments demonstrate that the proposed methodology is both effective and robust in accurately segmenting texts containing a mixture of watermarked and non-watermarked content.


Key Contributions

  • Generalizes likelihood-ratio LLM watermark detection to a flexible weighted formulation and adapts it to Inverse Transform Sampling (ITS) watermarking
  • Extends detection to text segmentation via change-point detection on a p-value sequence derived from overlapping substrings
  • Removes the need for accurate prompt estimation by using preceding tokens as natural prompts, improving practical applicability

🛡️ Threat Analysis

Output Integrity Attack

Proposes watermark detection and provenance tracing for LLM-generated text outputs — squarely output integrity. Extends prior likelihood-ratio watermark detection to a segmentation framework that localizes watermarked substrings within mixed human/LLM text, enabling reliable AI-generated content identification.


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_time
Applications
llm-generated text detectionai content provenanceacademic integrity