defense 2025

A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park 1, Hyejin Park 2, Hyeseon Ahn 1, Yo-Sub Han 1

0 citations · 34 references · arXiv

α

Published on arXiv

2510.13829

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

STELA surpasses prior adaptive watermarking methods (SWEET, EWD) in detection robustness across three typologically distinct languages while requiring no access to model logits for detection.

STELA

Novel technique introduced


As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.


Key Contributions

  • STELA: a watermarking framework that dynamically modulates signal strength based on POS n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts and strengthening it where linguistic flexibility is high
  • Fully model-free (logit-free) detection enabling public verification using only a lightweight POS tagger, unlike entropy-based methods that require access to LLM logits
  • Cross-typological validation across analytic English, isolating Chinese, and agglutinative Korean, demonstrating superior detection robustness over KGW, SWEET, and EWD baselines

🛡️ Threat Analysis

Output Integrity Attack

Embeds statistical watermarks in LLM-generated text outputs (not model weights) to enable publicly verifiable detection of AI-generated content — core output integrity / content provenance contribution.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Applications
llm text generationai-generated text detectioncontent provenance