defense 2025

HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

Zhiqing Hu 1,2, Chenxu Zhao 1,2, Jiazhong Lu 3, Xiaolei Liu 2,1

0 citations · 25 references · arXiv

α

Published on arXiv

2512.19378

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Triple-partition watermarking achieves higher true-positive rate and lower false-positive rate than the binary KGW baseline at fixed FPR while maintaining comparable text quality.

HATS

Novel technique introduced


Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.


Key Contributions

  • Three-color (Green/Yellow/Red) vocabulary partition at each decoding step that restricts sampling to Green and Yellow sets, adding an additional watermark signal over binary KGW.
  • Detection method combining Green-enrichment and Red-depletion statistics as one-sided z-scores aggregated via Fisher's combined p-value test.
  • Empirical evaluation on Llama 2 7B demonstrating improved TPR and FPR over the KGW baseline while preserving text readability.

🛡️ Threat Analysis

Output Integrity Attack

Embeds statistical watermarks in LLM-generated text outputs via vocabulary partitioning to enable provenance verification and AI-generated content detection — a content output integrity scheme, not model IP protection.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
Llama 2 7B (generation model used for evaluation)
Applications
llm-generated text detectioncontent provenance verificationai text watermarking