defense 2026

Watermarking LLM Agent Trajectories

Wenlong Meng 1, Chen Gong 2, Terry Yue Zhuo 3, Fan Zhang 1, Kecen Li 2, Zheng Liu 2, Zhou Yang 4, Chengkun Wei 1, Wenzhi Chen 1

0 citations · 62 references · arXiv (Cornell University)

α

Published on arXiv

2602.18700

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

Achieves average detection AUC of 94.3 on Qwen-2.5-Coder-7B across three agent domains, with AUC remaining above 85 under dataset filtering, paraphrasing, and action-substitution attacks.

ActHook

Novel technique introduced


LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embeds hook actions that are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen-2.5-Coder-7B while incurring negligible performance degradation.


Key Contributions

  • ActHook: first watermarking method for LLM agent trajectory datasets, embedding 'hook actions' at decision points that activate only when a secret input key is present
  • Behavior-level watermark design that exploits high token-entropy positions at action boundaries for reliable acquisition and is robust to paraphrasing/filtering attacks
  • Evaluation across three agent domains (math reasoning, web search, software engineering) showing 94.3 average detection AUC on Qwen-2.5-Coder-7B with negligible performance degradation

🛡️ Threat Analysis

Output Integrity Attack

ActHook watermarks TRAINING DATA (agent trajectory datasets), not model weights. The goal is provenance tracking and IP protection: detecting whether a downstream LLM agent was trained on stolen/unauthorized trajectory data. Per the classification guide, watermarking training data to detect misappropriation maps to ML09 (output integrity / content provenance), not ML05 (which requires the watermark to be in model weights for model IP protection).


Details

Domains
nlpreinforcement-learning
Model Types
llmtransformer
Threat Tags
training_timeblack_box
Datasets
SWE-BenchMind2Webmathematical reasoning benchmarks
Applications
llm agent training data ip protectiontrajectory dataset provenance tracking