attack 2026

On Google's SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

Romina Omidi , Yun Dong , Binghui Wang

0 citations

α

Published on arXiv

2603.03410

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

The mean score in SynthID-Text is provably vulnerable to layer inflation, enabling an attack that breaks watermark detection, while the Bayesian score is theoretically shown to be more robust.

Layer Inflation Attack

Novel technique introduced


Google's SynthID-Text, the first ever production-ready generative watermark system for large language model, designs a novel Tournament-based method that achieves the state-of-the-art detectability for identifying AI-generated texts. The system's innovation lies in: 1) a new Tournament sampling algorithm for watermarking embedding, 2) a detection strategy based on the introduced score function (e.g., Bayesian or mean score), and 3) a unified design that supports both distortionary and non-distortionary watermarking methods. This paper presents the first theoretical analysis of SynthID-Text, with a focus on its detection performance and watermark robustness, complemented by empirical validation. For example, we prove that the mean score is inherently vulnerable to increased tournament layers, and design a layer inflation attack to break SynthID-Text. We also prove the Bayesian score offers improved watermark robustness w.r.t. layers and further establish that the optimal Bernoulli distribution for watermark detection is achieved when the parameter is set to 0.5. Together, these theoretical and empirical insights not only deepen our understanding of SynthID-Text, but also open new avenues for analyzing effective watermark removal strategies and designing robust watermarking techniques. Source code is available at https: //github.com/romidi80/Synth-ID-Empirical-Analysis.


Key Contributions

  • First theoretical analysis of SynthID-Text's tournament-based watermarking, proving the mean score is inherently vulnerable to increasing tournament layers
  • Novel 'layer inflation attack' that exploits the proven vulnerability to break SynthID-Text's watermark detection
  • Proof that the Bayesian score offers superior robustness over the mean score, and that the optimal Bernoulli parameter for watermark detection is 0.5

🛡️ Threat Analysis

Output Integrity Attack

SynthID-Text watermarks LLM text outputs to authenticate AI-generated content (output provenance/integrity). The paper's primary contributions are theoretical analysis of this content watermarking system and a novel 'layer inflation attack' that defeats it — a direct watermark removal/evasion attack on content integrity protection.


Details

Domains
nlp
Model Types
llm
Threat Tags
white_boxinference_time
Applications
llm text watermarkingai-generated text detection