Watermarking LLM Agent Trajectories
Wenlong Meng 1, Chen Gong 2, Terry Yue Zhuo 3, Fan Zhang 1, Kecen Li 2, Zheng Liu 2, Zhou Yang 4, Chengkun Wei 1, Wenzhi Chen 1
Published on arXiv
2602.18700
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
Achieves average detection AUC of 94.3 on Qwen-2.5-Coder-7B across three agent domains, with AUC remaining above 85 under dataset filtering, paraphrasing, and action-substitution attacks.
ActHook
Novel technique introduced
LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embeds hook actions that are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen-2.5-Coder-7B while incurring negligible performance degradation.
Key Contributions
- ActHook: first watermarking method for LLM agent trajectory datasets, embedding 'hook actions' at decision points that activate only when a secret input key is present
- Behavior-level watermark design that exploits high token-entropy positions at action boundaries for reliable acquisition and is robust to paraphrasing/filtering attacks
- Evaluation across three agent domains (math reasoning, web search, software engineering) showing 94.3 average detection AUC on Qwen-2.5-Coder-7B with negligible performance degradation
🛡️ Threat Analysis
ActHook watermarks TRAINING DATA (agent trajectory datasets), not model weights. The goal is provenance tracking and IP protection: detecting whether a downstream LLM agent was trained on stolen/unauthorized trajectory data. Per the classification guide, watermarking training data to detect misappropriation maps to ML09 (output integrity / content provenance), not ML05 (which requires the watermark to be in model weights for model IP protection).