defense 2025

CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models

Yu Zhang 1, Shuliang Liu 1, Xu Yang 2, Xuming Hu 1

1 citations · 34 references · arXiv

α

Published on arXiv

2510.02342

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves 82.3% pass@1 on HumanEval and 100% AUROC on StackEval, outperforming entropy-threshold baselines across all cross-task benchmarks without sacrificing detection accuracy.

CATMark

Novel technique introduced


Watermarking algorithms for Large Language Models (LLMs) effectively identify machine-generated content by embedding and detecting hidden statistical features in text. However, such embedding leads to a decline in text quality, especially in low-entropy scenarios where performance needs improvement. Existing methods that rely on entropy thresholds often require significant computational resources for tuning and demonstrate poor adaptability to unknown or cross-task generation scenarios. We propose \textbf{C}ontext-\textbf{A}ware \textbf{T}hreshold watermarking ($\myalgo$), a novel framework that dynamically adjusts watermarking intensity based on real-time semantic context. $\myalgo$ partitions text generation into semantic states using logits clustering, establishing context-aware entropy thresholds that preserve fidelity in structured content while embedding robust watermarks. Crucially, it requires no pre-defined thresholds or task-specific tuning. Experiments show $\myalgo$ improves text quality in cross-tasks without sacrificing detection accuracy.


Key Contributions

  • First systematic investigation of watermarking in cross-task (mixed-modality) generation scenarios such as interleaved code and natural language
  • Dynamic thresholding mechanism that clusters tokens via KL divergence from learned prototypes and auto-computes per-context entropy thresholds without manual tuning
  • Theoretical lower bound on detection z-score under adaptive thresholding, with empirical results showing 82.3% pass@1 on HumanEval and 100% AUROC on StackEval

🛡️ Threat Analysis

Output Integrity Attack

CATMark embeds imperceptible statistical watermarks into LLM text outputs to enable AI-generated content detection and content provenance tracking. The watermark is in the generated text (outputs), not in model weights, making this a classic ML09 output integrity / content watermarking contribution.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_time
Datasets
HumanEvalStackEval
Applications
llm text generationcode generationai-generated text detectioncontent provenance