benchmark 2025

Signature vs. Substance: Evaluating the Balance of Adversarial Resistance and Linguistic Quality in Watermarking Large Language Models

William Guo 1, Adaku Uchendu 2, Ana Smith 2

0 citations · 23 references · ICDMW

α

Published on arXiv

2511.13722

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

All evaluated watermarking techniques preserve semantics but deviate from unwatermarked writing style, and are susceptible to adversarial removal — particularly back-translation attacks.


To mitigate the potential harms of Large Language Models (LLMs)generated text, researchers have proposed watermarking, a process of embedding detectable signals within text. With watermarking, we can always accurately detect LLM-generated texts. However, recent findings suggest that these techniques often negatively affect the quality of the generated texts, and adversarial attacks can strip the watermarking signals, causing the texts to possibly evade detection. These findings have created resistance in the wide adoption of watermarking by LLM creators. Finally, to encourage adoption, we evaluate the robustness of several watermarking techniques to adversarial attacks by comparing paraphrasing and back translation (i.e., English $\to$ another language $\to$ English) attacks; and their ability to preserve quality and writing style of the unwatermarked texts by using linguistic metrics to capture quality and writing style of texts. Our results suggest that these watermarking techniques preserve semantics, deviate from the writing style of the unwatermarked texts, and are susceptible to adversarial attacks, especially for the back translation attack.


Key Contributions

  • Comparative robustness evaluation of four LLM text watermarking methods (KGW, SIR, EWD, Unbiased Watermarking) against paraphrasing and back-translation adversarial attacks
  • Linguistic quality analysis measuring how watermarking deviates from unwatermarked text in semantics and writing style
  • Finding that back-translation is more effective than paraphrasing at stripping watermarks, and watermarked texts preserve semantics but deviate in style

🛡️ Threat Analysis

Output Integrity Attack

Directly evaluates LLM output content watermarking schemes (KGW, SIR, EWD, Unbiased) and attacks that strip those watermarks (paraphrasing, back translation), which is the core ML09 threat of output integrity / content provenance for AI-generated text.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
C4
Applications
llm-generated text detectionai content authentication